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INTRODUCTION 



According to Putnam to talk of “facts” without specifying the language to be used is to 
talk of nothing; “object” itself has many uses and as we creatively invent new uses of 
words “we find that we can speak of ‘objects' that were not ‘values of any variable’ in 
any language we previously spoke” 1 . The notion of object becomes, then, like the notion 
of reference, a sort of open land, an unknown territory. The exploration of this land ap- 
pears to be constrained by use and invention. But, we may wonder, is it possible to guide 
invention and control use? In what way, in particular, is it possible, at the level of natu- 
ral language, to link together program expressions and natural evolution? 

To give an answer to these onerous questions we should immediately point out that 
cognition (as well as natural language) has to be considered first of all as a peculiar func- 
tion of active biosystems and that it results from complex interactions between the or- 
ganism and its surroundings. “In the moment an organism perceives an object of what- 
ever kind, it immediately begins to ‘interpret’ this object in order to react properly to it 
... It is not necessary for the monkey to perceive the tree in itself... What counts is sur- 
vival” 2 . 

In this sense, if it is clearly true that we cannot talk of facts without specifying the 
language to be used, it is also true that we cannot perceive objects (and their relations) 
if we are unable continuously to “reconstruct” and invent new uses of the cognitive 
tools at work at the level of visual cognition. As Ruse remarks, the real world cer- 
tainly exists, but it is the world as we interpret it. In what way, however, can we in- 
terpret it adequately? How can we manage to “feel” that we are interpreting it ac- 
cording to the truth (albeit partially)? In other words, if perceiving and, from a more 
general point of view, knowing is interpreting, and if the interplay existing between a 
complex organism and its environment is determined by the compositio of interpreta- 
tive functions, actual emergence of meaning, and evolution, in what way can humans 
describe the inner articulation of this mysterious interplay, mirroring themselves in 
this composition Does this very compositio possess a “transcendental” character? 
How, generally speaking, can our brains give rise to our minds? 3 What types of func- 
tions and rules should we identify (and invent) in order to describe and contemporar- 
ily construct those evolutionary paths of cognitive processes (and in particular of vi- 
sual cognition) that progressively constitute the very fibres of our being ? How can 
we model that peculiar texture of life and knowledge in flux that characterises our 
mental activity? 

In order to do this we need, as is evident, ever-new simulation models considered in turn, 
from an abstract point of view, as specific mathematical objects. These models should ad- 
equately represent specific systems capable of autonomously self-organising in response to 
a rapidly changing and unpredictable world. In other words, we are really faced with the 
necessary outlining of models able to explain (and unfold) behavioural and brain data and 
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contemporarily to interact with the dynamics that they try to describe in order to “prime” 
new patterns of possible integration. These models, once established, may act as operative 
channels and may be exactly studied as mathematical objects, i.e., they articulate as self- 
organising models. When we perceive and “interpret”, our mind actually constructs mod- 
els so that the brain can be forged and guided in its activity of reading the world. 

In this sense, informational data must be considered as emergent properties of a dy- 
namical process taking place at the level of the mind. From a general point of view, be- 
haviour must be understood as an ensemble of emergent properties of networks of neu- 
rones. At the same time, the neurones and their interactions are necessary in order to cor- 
rectly define the laws proper to the networks whose emergent properties map onto be- 
haviour. Thus, on the one hand, we need a powerful theoretical language: the language, 
in particular, of dynamical systems and, on the other, we contemporarily need self-or- 
ganising models able to draw, fix and unfold the link between real emergence and men- 
tal construction as well as the link between the holistic fabric of perception and the step 
by step developmental processes in action. 

The chapter by S. Grossberg “Neural Models of Seeing and Thinking” aims at a very 
clear exploration of the role played by the neural models at the level of Cognitive Sci- 
ence and in particular at the level of visual cognition. According to Grossberg the brain 
is organised in parallel processing streams. These streams are not independent modules 
however; as a matter of fact strong interactions occur between perceptual qualities. “A 
great deal of theoretical and experimental evidence suggests that the brain’s processing 
streams compute complementary properties. Each stream’s properties are related to those 
of a complementary stream, much as a lock fits its key or two pieces of a puzzle fit to- 
gether” 4 . “How, then, do these complementary properties get synthesised into a consis- 
tent behavioural experience? It is proposed that interactions between these processing 
streams overcome their complementary deficiencies and generate behavioural properties 
that realize the unity of conscious experiences. In this sense, pairs of complementary 
streams are the functional units, because only through their interactions can key behav- 
ioural properties be competently computed. These interactions may be used to explain 
many of the ways in which perceptual qualities are known to influence each other” 5 . 

Each stream can possess multiple processing stages, a fact which, according to Gross- 
berg, suggests that these stages realize a process of hierarchical resolution of uncertainty. 
The computational unity is thus not a single processing stage but a minimal set of pro- 
cessing stages that interact within and between complementary processing streams. “The 
brain thus appears as a self-organising measuring device in the world and of the world” 6 . 

The neural theory FACADE illustrated by Grossberg in his article suggests how and 
why perceptual boundaries and perceptual surfaces compute complementary properties. 
In particular, Grossberg, using the famous Kanizsa square, shows that a percept is due to 
an interaction between the processing streams that form perceptual boundaries and sur- 
faces. In this sense: “a boundary formation process in the brain is indeed the mechanism 
whereby we perceive geometrical objects such as lines, curves, and textured objects. 
Rather than being defined in terms of such classical units as points and lines, these 
boundaries arise as a coherent pattern of excitatory and inhibitory signals across a mixed 
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co-operative-competitive feedback network that is defined by a non-linear dynamical 
system describing the cellular interactions from the retina through LGN and the VI In- 
terblob and V2 Interstripe areas” 7 . 

These interactions select the best boundary grouping from among many possible inter- 
pretations of a scene. The winning grouping is represented either by an equilibrium point 
or a synchronous oscillation of the system, depending on how system parameters are 
chosen. FACADE theory suggests how the brain may actually represent these properties 
using non-linear neural networks that do a type of online statistical inference to select 
and complete the statistically most-favoured boundary groupings of a scene while sup- 
pressing noise and inconsistent groupings. 

The boundary completion and the surface filling-in as suggested by Grossberg in his 
article thus represent a very different and innovative approach with respect to the classi- 
cal geometrical view as established in terms of surface differential forms. Let us just re- 
mark that according to a conservative extension of this perspective, from an epistemo- 
logical point of view, simulation models no longer appear as “neutral” or purely specu- 
lative. On the contrary, true cognition appears to be necessarily connected with success- 
ful forms of reading, those forms, in particular, that permit a specific coherent unfolding 
of the deep information content of the Source. Therefore the simulation models, if valid, 
materialise as “creative” channels, i.e., as autonomous functional systems (or self-or- 
ganising models), as the same roots of a new possible development of the entire co-evo- 
lutive system represented by mind and its Reality. 

The following two chapters are equally centred on an in-depth analysis of Kanizsa’s ex- 
periments, although according to different theoretical and modelistic perspectives. 

The aim of Petitot’s chapter “Functional Architecture of the Visual Cortex and Variational 
Models for Kanizsa’a Modal Subjective Contours” is to present a neuro-geometrical model 
for generating the shape of Kanizsa’s modal subjective contours. This model is based on the 
functional architecture of the primary areas of the visual cortex. The key instrument utilised 
by Petitot is the idea of variational model as introduced by S. Ullman in 1976. This idea was 
improved and enlarged in 1992 by D. Mumford by the outlining of a fundamental model 
based on the physical concept of elastica. Mumford essentially aimed to define curves si- 
multaneously minimising the length and the integral of the square of the curvature K, i.e. the 
energy E = J (a K + (3) 2 ds where ds is the element of arc length along the curve. 

Petitot presents a slightly different variational model based on the concept of “geodes- 
ic curves” in V 1 that results more realistic at the neural level. As is well known, at the 
level of visual cortex ’’due to their structure, the receptive fields of simple cells detect a 
preferential orientation. Simplifying the situation, we can say they detect pairs (a, p) of 
a spatial (retinal) position a and a local orientation p at a. They are organised in small 
modules called hypercolumns (Hubei and Wiesel) associating retinotopically to each po- 
sition a of the retina R a full exemplar P« of the space of orientations p at a” 8 . 

A simplified schema of this structure (with a 1 -dimensional base R) is represented by 
a fibration of base R and fiber P. In geometry pairs (a,p) are called contact elements. 
Their set V = { («,/?) } need to be strongly structured to allow the visual cortex to com- 
pute contour integration. 
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Petitot underlines the importance of the discovery, at the experimental level, of the cor- 
tico-cortical horizontal connections. These connections exactly allow the system to com- 
pare orientations in two different hypercolumns corresponding to two different retinal 
positions a ad b. Besides the horizontal connections we may also individuate vertical 
connections. According to William Bosking (1997) the system of long-range horizontal 
connections can be summarised as preferentially linking neurones with co-oriented, co- 
axially aligned receptive fields. Starting from these experimental findings Petitot finally 
can show that: “what geometers call the contact structure of the fibration n: RxP *■ R is 
neurologically implemented” 9 . 

Thus, he can directly affirm that the integrability condition is a particular version of the 
Gestalt principle of “good continuation”. As emphasised by Field, Hayes, and Hess 
(1993) ““Elements are associated according to joint constraints of position and orienta- 
tion”.... “The orientation of the elements is locked to the orientation of the path; a 
smooth curve passing through the long axis can be drawn between any two successive 
elements” ” 10 . Hence the possibility of the individuation of a discrete version of the inte- 
grability condition. 

According to these results, Petitot can conclude his analysis saying that ’"due to the very 
strong geometrical structure of the functional architecture (hypercolumns, pinwheels, 
horizontal connections), the neural implementation of Kanizsa’s contours is deeply 
linked with sophisticated structures belonging to what is called contact geometry and 
with variational models analogue to models already well known in physics” 11 . 

In other words, a neurally plausible model of Kanizsa-curves at the VI level reveals it- 
self as linked first of all to the articulation of specific geodesic curves. In this way it is 
possible to progressively identify some of the principal factors of that “perceptual geom- 
etry” that, at the moment, presents itself as the real basis, from a genetic (and genealog- 
ical) point of view, of classical Geometry. 

In their chapter “Gestalt Theory and Computer Vision” A. Desolneux, L.Moisan and 
J.M. Morel also start from an in-depth analysis of Kanizsa’s contribute to Gestalt theo- 
ry. Their approach, however, essentially concerns a mathematical model characterised in 
discrete and not in continuous terms. In their opinion both Gestalt Theory and classical 
Information Theory have attempted to answer the following question: how is it possible 
to individuate global percepts starting from the local, atomic information contained in an 
image? 

The authors distinguish two kinds of laws at the gestaltic level: 1) the grouping laws 
(like vicinity and similarity) whose aim is to build up partial gestalt; 2) the gestalt prin- 
ciples whose aim it is to operate a synthesis between the partial groups obtained by the 
elementary grouping laws. 

The obtained results show that “ ...-there is a simple computational principle (the so- 
called Helmholtz principle), inspired from Kanizsa’s masking by texture, which allows 
one to compute any partial gestalt obtainable by a grouping law ...- this computational 
principle can be applied to a fairly wide series of examples of partial gestalts, namely 
alignments, clusters, boundaries, grouping by orientation, size or grey level” 12 . 

The authors also show that “...the partial gestalt recursive building up can be led up 
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to the third level (gestalts built by three successive grouping principles)” 13 . In particu- 
lar, they show that all partial gestalts are likely to lead to wrong scene interpretations. 
In this way it is possible to affirm that wrong detections are explainable by a conflict 
between gestalts. Hence the research for principles capable of resolving some of these 
conflicts in an adequate way such as, for instance, the Minimal Description Length 
principles. 

The central core of the analysis is represented by the definition of several quantitative 
aspects, implicit in Kanizsa’s definition of masking and by the attempt to show that one 
particular kind of masking, Kanizsa’s masking by texture, suggests precise computa- 
tional procedures. Actually, “The pixels are the computational atoms from which gestalt 
grouping procedures can start. Now, if the image is finite, and therefore blurry, how can 
we infer sure events as lines, circles, squares and whatsoever gestalts from discrete da- 
ta? If the image is blurry all of these structures cannot be inferred as completely sure; 
their exact location must remain uncertain. This is crucial: all basic geometric informa- 
tion in the image has a precision” 14 . 

Moreover, the number A?™,/ of possible configurations for partial gestalts is finite be- 
cause the image resolution is bounded. Starting from these simple considerations the au- 
thors apply a general perception principle called Helmholtz principle: “This principle 
yields computational grouping thresholds associated with each gestalt quality. It can be 
stated in the following generic way. Assume that atomic objects. Oi, O 2, .... 0„ are pres- 
ent in an image. Assume that k of them, say Oi,..., Ot, have a common feature, say, same 
colour, same orientation, position etc. We are then facing the dilemma: is this common 

feature happening by chance or is it significant and enough to group Oi CL? In order 

to answer this question, we make the following mental experiment: we assume a priori 
that the considered quality has been randomly and uniformly distributed on all objects 
Oi,.., Oi,. Then we (mentally) assume that the observed position of objects in the image 
is a random realisation of this uniform process. We finally ask the question: is the ob- 
served repartition probable or not? If not, this proves a contrario that a grouping process 
(a gestalt) is at stake. Helmholtz principle states roughly that in such mental experiments, 
the numerical qualities of the objects are assumed to be equally distributed and inde- 
pendent” 15 . The number of “false alarms” (NFA) of an event measures the “meaningful- 
ness” of this event: the smaller it is, the more meaningful the event is. 

This kind of measure is perfectly coherent with an ancient measure of semantic infor- 
mation content as introduced by R. Carnap e Y. Bar Hillel in 1952. Actually, in order to 
model holistic perception and the Gestalten in action we have to take into account not 
only the syntactic measures of information but also the semantic ones; only in this way 
shall we be able to give an explanation in computational terms of that peculiar (inten- 
sional) meaningfulness that characterises real perception. 

If we aim to explain in modelistic terms the reality of visual perception we have, how- 
ever, not only to take into account the different intensional aspects of meaningfulness (as 
well as the intentional ones) but also the phenomenal consciousness (actually, at the neu- 
ral level, visual consciousness appears as distributed in space and time, as S. Zeki has re- 
cently pointed out). 
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It is precisely to the problem of a possible simulation of phenomenal consciousness that 
the chapters by J. K. O’Regan, E. Myin and A. Noe and by A. Di Ferdinando and D. 
Parisi are devoted. 

In what way can we explain how physical processes (neural and computational) can 
produce experience: i.e. phenomenal consciousness? Gestalt is not only a mathematical 
or computational construction: it is something that lives, of which we have direct and ho- 
listic experience. 

As is well known, several scholars have argued that phenomenal consciousness cannot 
be explained in functional or neural terms. According to O’Regan’s, Myin's and Noe’s 
opinion as expressed in the article “Towards an Analytic Phenomenology: the Concepts 
of Bodiliness and Grabbiness” the problem is misplaced: feel is not generated by a neu- 
ral mechanism at all, rather it is exercising what the neural mechanism allows the or- 
ganism to do. “An analogy can be made with “life”: life is not something which is gen- 
erated by some special organ in biological systems. Life is a capacity that living systems 
possess. An organism is alive when it has the potential to do certain things, like repli- 
cate, move, metabolise, etc. But it need not be doing any of them right now, and still it 
is alive... When we look out upon the world, we have the impression of seeing a rich, 
continuously present visual panorama spread out before us. Under the idea that seeing 
involves exercising a skill however, the richness and continuity of this sensation are not 
due to the activation in our brains of a neural representation of the outside world. On the 
contrary, the ongoingness and richness of the sensation derive from the knowledge we 
have of the many different things we can do (but need not do) with our eyes, and the sen- 
sory effects that result from doing them (O'Regan 1992). Having the impression of a 
whole scene before us comes, not from every bit of the scene being present in our minds, 
but from every bit of the scene being immediately available for “handling” by the slight- 
est flick of the eye” 16 . 

According to this point of view, we no longer need to postulate a neural process that 
generates phenomenal consciousness, this kind of consciousness must, on the contrary, 
be considered as a skill people exercise. 

Thinking, however, is different from seeing: thinking has no perceptual quality. The 
fundamental difference between mental phenomena that have no feel (like for instance 
knowledge) and mental phenomena that have feel (like sensations) can be explained 
through the introduction of the concepts of bodiliness and grabbiness. 

“Bodiliness is the fact that when you move your body, incoming sensory information 
immediately changes. The slightest twitch of an eye muscle displaces the retinal image 
and produces a large change in the signal coming along the optic nerve. Blinking, mov- 
ing your head or body will also immediately affect the incoming signal” 17 . 

On the other hand, grabbiness is the fact that sensory stimulation can grab our attention 
away from what we were previously doing. Bodiliness and grabbiness are objectively 
measurable quantities that determine the extent to which there is something it is like to 
have a sensation. 

It is the order of what I can do potentially to organise which determines the horizon of 
my seeing, and the horizon inserts itself within this type of self-organisation. When we 
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look at a scene, we see, we forecast and we self-organise at the same time. If we want to 
build a robot that feels we have to provide the robot with mastery of the laws that gov- 
ern the way its actions affect its sensory input. We have to wire up its sensory receptors 
and we have to give it “access to the fact that it has mastery of the skills associated with 
its sensory exploration” 18 . 

The point of view according to which internal representations are action-based is also 
assumed by A. Di Ferdinando and D. Parisi, in their chapter “Internal Representations of 
Sensory Input Reflect the Motor Output with which Organisms Respond to the Input”. 

“What determines how sensory input is internally represented? The traditional answer 
is that internal representations of sensory input reflect the properties of the input. This 
answer is based on a passive or contemplative view of our knowledge of the world which 
is rooted in the philosophical tradition and, in psychology, appears to be almost manda- 
tory given the fact that, in laboratory experiments, it is much easier for the researcher to 
control and manipulate the sensory input which is presented to the experimental subjects 
than the motor output with which the subjects respond to the input. However, a minori- 
ty view which is gaining increasing support (Gibson, 1986; O’Regan and Noe, in press) 
is that internal representations are instead action-based, that is, that the manner in which 
organisms internally represent the sensory input reflects the properties of the actions 
with which the organisms respond to the sensory input rather than the properties of the 
sensory input” 19 . 

The authors present, in particular, a series of computer simulations using neural net- 
works that tend to support the action-based view of internal representations. In their 
opinion, internal representations are not symbolic or semantic entities, they are patterns 
of activation states in the network’s internal units which are caused by input activation 
patterns and which in turn cause activation pattern in the network’s output units. The au- 
thors distinguish between micro-actions and macro-actions, the latter are sequences of 
microactions that allow the organism to reach some goal. Internal representations exact- 
ly encode the properties of macro-actions. The properties of the visual input are retained 
on the internal representations only insofar as they are relevant for the action to be exe- 
cuted in response of the visual input. Hence the necessity to resort to concepts like adap- 
tation and assimilation. Hence, on the other hand, the importance, with respect to this 
frame of reference, of the Adaptive Resonance Theory as outlined by Grossberg, a theo- 
ry that, in particular, predicts that all conscious states are resonant states of the brain. 

The theme concerning the interaction between man and machine and the possible con- 
struction of a robot able to observe human motion also constitutes the central core of the 
chapter by L. Goncalves, E. Di Bernardo and R Perona “Movemes for Modeling Bio- 
logical Motion Perception”. As the authors write “Perceiving human motion, actions and 
activities is as important to machines as it is to humans. People are the most important 
component of a machine’s environment. Endowing machines with biologically-inspired 
senses, such as vision, audition, touch and olfaction appears to be the best way to build 
user-friendly and effective interfaces. Vision systems which can observe human motion 
and, more importantly, understand human actions and activities, with minimal user co- 
operation are an area of particular importance” 20 . 
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However: “While it is easy to agree that machines should “look” at people in order to 
better interact with them, it is not immediately obvious which measurements should a 
machine perform on a given image sequence, and what information should be extracted 
from the human body. There are two classes of applications: “metric” applications where 
the position of the body has to be reconstructed in detail in space-time (e.g. used as in- 
put for positioning an object in a virtual space), and “semantic” applications where the 
meaning of an action (e.g. “she is slashing through Rembrandt’s painting”) is required. 
The task of the vision scientist/engineer is to define and measure “visual primitives” that 
are potentially useful for a large number of applications. These primitives would be the 
basis for the design of perceptual user interfaces ...substituting mouse motions and 
clicks, keystrokes etc. in existing applications, and perhaps enabling entirely new appli- 
cations. Which measurements should we take ? 21 

“In looking for a model of human motion one must understand the constraints to such 
motion. First of all: our motions are constrained both by the kinematics and by the dy- 
namics of our body. Our elbows are revolute joints with one degree of freedom (DOF), 
our shoulders are ball joints with three DOF etc. Moreover, our muscles have limited 
force, and our limbs have limited acceleration. Knowledge of the mechanical properties 
of our bodies is helpful in constraining the space of solutions of biological motion per- 
ception. However, we postulate that there is a much more important constraint: the mo- 
tion of our body is governed by our brain. Apart from rare moments, when we are either 
competing in sport or escaping an impending danger, our movements are determined by 
the stereotypical trajectories generated by our brain. . . the dynamics of our body at most 
acts as a low-pass filter” 22 . 

However, generating trajectories is a complex computational task. Neurophysiological 
evidence suggest that our nervous system encodes complex motions and discrete se- 
quences of elementary trajectories. “This suggests a new computational approach to bi- 
ological motion perception and to animation. One could define a set of elementary mo- 
tions or movemes which would roughly correspond to the ‘elementary units of motion’ 
used by the brain. One could represent complex motions by concatenating and combin- 
ing appropriate movemes. These movemes would be parameterized by ‘goal’ parameters 
in Cartesian space. This finds analogies in other human behaviours: the “phonemes” are 
the elementary units both in speech perception and production; in handwriting one thinks 
of ‘strokes’ as the elementary units” 23 . 

From a general point of view, in order to encode microactions we need a language, we 
need tools to compress information. Movemes greatly compress motion information. 
They appear to be a natural and rich representation which the brain might employ in per- 
ceiving biological motion. Many questions, however, arise. In what way can we seman- 
tically model and handle the processes of information compression as they articulate at 
the cognitive level? How many movemes might there be? How is it possible to build a 
complete catalogue? What types of syntactical laws govern their generation? What about 
the link between this generation and the unfolding of form constraints? 

Lorenceau’s chapter “Form Constraints in Motion Integration, Segmentation and Se- 
lection” also concerns the realm of form and motion selection. In this article, however, 
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particular attention is devoted to the problems concerning the action expressed by the 
constraints at the level of form construction. According to Lorenceau, “Perception is a 
process by which living organisms extract regularities from the physical fluxes of vary- 
ing physical characteristics in the external world in order to construct the stable repre- 
sentations that are needed for recognition, memory formation and the organisation of ac- 
tion. The exact nature of the process is still not well understood as the type of regulari- 
ties that are indeed used by sensory systems can be manifold. However, perception is not 
a process by which living organisms would reproduce the physical fluxes such as to build 
an internal representation identical to its physical counterpart. One issue then is to un- 
derstand the kind of physical regularities that are relevant for perceiving and recognising 
events in the outside world” 24 . 

We may, following the gestaltist approach, develop experimental paradigms to define 
and isolate the general rules underlying perceptual organisation. 

“In vision, figure/ground segregation and perceptual grouping of individual tokens into a 
“whole” appeared to strongly rely on several rules such as good continuity, proximity, clo- 
sure, symmetry, common fate, synchronicity etc. Most importantly, these principle define 
spatial and temporal relationships between “tokens”, whatever the exact nature of these to- 
kens: dots, segments, colour, contours, motion, etc. Implicitly, the general model underly- 
ing the Gestaltist approach is a geometrical one, stressing the spatial relations between 
parts rather than concerned with the intrinsic processing of the parts themselves. However, 
the attempt of the gestalt school to offer a plausible neuronal perspective that could explain 
their observations on perceptual organisation failed, as the Gestaltists were thinking in term 
of an isomorphism between external and internal geometrical rules whereby spatial rela- 
tionships between neurones would mimic the geometry of the stimulus. Electrophysiolog- 
ical and anatomical studies did not revealed such an isomorphism” 25 . In his paper 
Lorenceau takes into account recent neurophysiological findings that suggest how geo- 
metrical principles may be implemented in the brain, also discussing hypotheses about the 
physiological mechanisms that may underlie perceptual grouping. Lorenceau, in particu- 
lar, points out that “In support of a functional link between neurones through horizontal 
connections in primary visual cortex, a number of recent psychophysical studies uncovered 
strong contrast dependent centre-surround interactions, either facilitatory or suppressive, 
that occur when one or several oriented test stimuli are analysed in the presence of sur- 
rounding oriented stimuli. For instance, contrast sensitivity is improved by similar flankers, 
collinear and aligned with the test stimulus. Changing the relative distance, orientation, 
spatial frequency or contrast of the flankers modulates the change in sensitivity, allowing 
the analysis of the architecture of these spatial interactions” 26 . 

From a general point of view, he underlines the fact that feedback and long range con- 
nections within a single area provide a possible physiological substratum to compute 
some of the geometrical properties of the incoming image. With respect to the interplay 
existing between form and motion the results presented by the author demonstrate the 
critical role played by geometrical information in global motion computation. 

“Local singularities such as vertices, junctions or line-ends appears to exert a strong 
control on the balance between integration and segmentation as salient contour termina- 
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tors appear to be used to parse the image into parts. Global geometrical image properties 
also appear to provide strong constraints on the integration process, as integrating mov- 
ing contours into a global motion is a simple task for some configurations (diamonds) 
while it is difficult for others (crosses and chevrons)” 27 . 

Actually: “Closure of the diamond by amodal completion (Kanisza, 1979), together 
with the filling-in of is its interior this may engenders, would serve effectively the seg- 
regation of the diamond from its background. Consequently, judging the diamond’s di- 
rection of rotation would be much easier than for open shapes which generate poorer re- 
sponses at the level of object representation.... The effects of boundary completion, fill- 
ing-in and figure/ground segregation, can all be considered broadly under the rubric of 
form processing. Our data suggest that the role of form information is to regulate 
whether motion integration should go ahead or not” 28 . 

Lastly, Lorenceau points out that numerous studies now support the idea that geomet- 
rical relationships between visual elements or “tokens” play a fundamental role in the 
perceptual organisation of form and motion. The available anatomical and physiological 
evidence suggests that the neural circuitry described in the primary visual cortex pos- 
sesses some of the properties needed to process the geometric features of the retinal in- 
put. With respect to this framework it is in any case necessary to underline that the in- 
teractions between form and motion are bi-directional. Neural circuitry appears able to 
individuate invariants and to connect these invariants according to different stages, phas- 
es and rules. 

When we speak in terms of tokens, syntactic and informational rules, invariants (and 
attractors) and so on, we actually try to describe (and creatively unfold) an information- 
al code utilised by our mind in order to realize an adequate recovery-reading of depth in- 
formation. Actually, we continuously invent and construct new models and visual lan- 
guages in order to perceive, i.e. to interpret according to the truth in a co-evolutive con- 
text. 

J. Ninio concludes his intriguing and “passionate” chapter “Scintillations, Extinctions 
and Other New Visual Effects” by means of these simple words: “I find more and more 
satisfaction, as Kanizsa did, in elaborating striking images. Whereas, in his case, the im- 
ages must have been the outcome of a completely rational line of thinking, in my case 
they came by surprise. They were - at least for Fig. 7 and 8, the unexpected reward of a 
very systematic work of variations in the geometry of the stimuli” 29 . 

In these words lies the soul of the chapter. Its “intelligence” is first of all in the images 
presented by the author: new effects, new possible “openings” of our mind, new ways of 
seeing. Like Kanizsa, Ninio too possesses the art of “constructing with maestria” and in 
the article this is predicated on the account of a lifetime. By means of his images-effects 
Ninio offers new “cues” for the “invention” of new languages, of new models of simu- 
lation and, at the same time, for the individuation of new “ways to be given”, of new in- 
tensional settings. 

M. Olivetti, R. Di Matteo, C. del Gratta, A. de Nicola, A. Ferretti and G.L. Romani in 
their chapter “Commonalities Between Visual Imagery and Imagery in Other Modalities: 
an Investigation by Means of fMRI “remark, first of all, that: ’’The attempt to shadow the 
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differences between seeing and thinking by stressing their similarities is not an episte- 
mologically correct operation, because by using principles related to another domain 
(like thinking) to explain vision may induce a pre-packed explication. Instead, stressing 
differences may lead to the discovery of new rules governing only one of the two 
processes under investigation (Kanizsa, 1991). We report this provoking statement of 
Kanizsa, while approaching our research on mental imagery for two main reasons: 1 ) the 
main part of the psychological research on imagery is devoted to visual imagery, im- 
plicitly assuming that imagery derived from other sensory modalities will present char- 
acteristics that are similar to those of visual imagery; 2) a lot of studies on visual imagery 
are devoted to assess whether primary perceptual circuits are implied also in imagery 
and, therefore, to assess how much seeing is similar to imaging. In this study we ac- 
cepted Kanizsa’s suggestion by trying to assess differences between visual and other- 
senses imagery in order to detect their peculiarities and the grade of their overlap” 30 . 

As the authors underline, the studies examining the relationship between imagery and 
processes related to modalities other than vision are very rare. The chapter is devoted 
to study how much imagery according to various sensory modalities is tied to the pro- 
cessing of visual features. It tries to identify first of all the common substrate of visual 
images and images generated according to other sensory modalities. “It consists of a 
fMRI block design while participants were requested to generate mental images cued 
by short sentences describing different perceptual object (shapes, sounds, odours, 
flavours, self-perceived movements and internal sensations). Imagery cues were pre- 
sented visually and were contrasted with sentences describing abstract concepts, since 
differences in activation during visual imagery and abstract thoughts were often as- 
sessed in literature” 31 . 

From this study, it is possible to derive three key findings: “First, common brain areas 
were found to be active in both visual imagery and imagery based on other sensory 
modalities. These common areas are supposed to reflect either the verbal retrieval of 
long-term representations or the segregation of long-term representations into highly 
interactive modality specific regions. Second, each imagery modality activates also dis- 
tinct brain areas, suggesting that high-level cognitive processes imply modality-specif- 
ic operations. This result is coherent with the domain-specific hypothesis proposed for 
the functioning of the fronto-parietal associative stream (Rushworth & Owen, 1998; 
Miller, 2000). Third, primary areas were never found to be active, suggesting that dif- 
ferent, though interactive, neural circuits underlie low-level and high-level processes. 
Although this claim is only indicative, as in this study no direct comparisons were made 
between imagery and perceptual/motor processes, it outlines the lack of primary cortex 
activation for imagery in those modalities that were not accompanied by any corre- 
sponding sensory stimulation due to either the visual presentation of the stimuli or to 
the noisy apparatus” 32 . 

The second part of the volume is devoted to a thorough analysis of a number of con- 
ceptual tools that revealed themselves particularly useful in interpreting cognitive and 
mental phenomena. Microgenesis, Synergetics, Self-Organisation Theory, Semantics, 
Evolutionary Theory etc., are extensively utilised in the different chapters in order to 
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clarify the mysterious relationships existing between emergence of meaning, self-organ- 
isation processes, emotion, differentiation and unfolding processes, symbolic dynamics 
etc. Actually, in order to outline more sophisticated models of cognitive activities (and 
in particular of that inextricable plot constituted by “seeing and thinking”) we have to 
examine and individuate specific theoretical methods capable, for instance, of taking in- 
to account also the intentional and semantic aspects of that specific, mental and biolog- 
ical process linking together growth with symbolic differentiation which characterises 
human cognition. 

In his chapter “Microgenesis, Immediate Experience and Visual Processes in Reading”, 
V. Rosenthal clearly illustrates and discusses the concept of microgenesis as correlated 
to specific processes of unfolding and differentiation. 

“Microgenetic development concerns the psychogenetic dynamics of a process that can 
take from a few seconds (as in the case of perception and speech) up to several hours or 
even weeks (as in the case of reading, problem solving or skill acquisition). It is a living 
process that dynamically creates a structured coupling between a living being and its en- 
vironment and sustains a knowledge relationship between that being and its world of life 
(Umwelt). This knowledge relationship is protensively embodied in a readiness for fur- 
ther action, and thereby has practical meaning and value. Microgenetic development is 
thus an essential form of cognitive process: it is a dynamic process that brings about 
readiness for action. Microgenesis takes place in relation to a thematic field which, how- 
ever unstable and poorly differentiated it might be, is always given from the outset. To 
this field, it brings stabilised, differentiated structure and thematic focalization, thereby 
conferring value and meaning to it. Figure/ground organisations are an illustration of a 
typical microgenetic development. Yet, one should bear in mind that however irresistible 
an organisation might appear, it is never predetermined but admits of alternative solu- 
tions, that a ‘figure’ embodies a focal theme, and that a ‘ground’ is never phenomeno- 
logically or semantically empty” 33 . 

At the level of microgenesis form, meaning and value cannot be considered as separate 
entities, on the contrary, perception is directly meaning-and value-laden with actual 
meaning developing along the global-to-local dynamics of microgenesis. Meaning, in 
this sense, is not the end product of perception but rather part and parcel of the percep- 
tual process. Actually: “...theories which separate sensory, semantic, motivational and 
emotional processes, and view perception as a construction of abstract forms out of 
meaningless features (only to discover later their identity and meaning), face in this re- 
spect insurmountable paradoxes. If semantics post-dates morphology, then it cannot af- 
fect form reconstruction, and if semantics is concomitant with form reconstruction, how 
can it influence morphological processing prior to ‘knowing’ what the latter is about? Fi- 
nally, since morphological and semantic processes are viewed as incommensurable, how 
can they be brought to cooperate together without recourse to yet another, higher-order 
process? Invoking such a process would either amount to conjuring up a sentient device 
of the homunculus variety or would stand in contradiction to the very postulate of the 
distinctness and independence of meaning and form” 34 . 

Perception, according to Rosenthal, necessarily refers to the assumption of the consis- 
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tency and meaningfulness of the world in which we live. It anticipates meaningful struc- 
tures and categorises them on a global dynamic basis. 

“The segmentation of the perceptual field into individual objects is thus the result of 
perceptual differentiation, and not the objective state of affairs that perception would 
merely seek to detect and acknowledge. In this sense, microgenesis is the process that 
breaks up the holistic fabric of reality into variably differentiated yet meaningful objects, 
beings and relations” 35 . 

Thus, the proposition that perception is based on reconstruction from elementary com- 
ponents raises more problems than it may be expected to solve. According to Rosenthal 
which quotes Husserl, the “now” has retentions and protentions i.e. there is a continuous 
and dynamic structure to experience. 

Various phenomena of perceptual completion, whether figures, surfaces or regions, 
provide an interesting illustration of microgenetic dynamics at work in perception. 
“Consider the famous example of the Kanizsa square where a collinear arrangement of 
edges of four white ‘pacmen’ (inducers) on a black background gives rise to the percep- 
tion of a black square whose area appears slightly darker than the background. In addi- 
tion, the surface of the square appears to the observer to be in front of four disks that it 
partly occludes. Since the square is perceived in spite of the absence of corresponding 
luminance changes (i.e. forming complete boundaries), and thus does not reflect any re- 
al distal object, it can only be created by the visual system which purportedly completes, 
closes, and fills in the surfaces between ‘fragments’, so as to make the resulting ‘sub- 
jective’ region emerge as figura standing in the ground. Yet, as Kanizsa (1976; 1979) apt- 
ly showed, this and other examples of so-called subjective contours demonstrate the ba- 
sic validity of Gestalt principles of field organisation, in particular of its figure/ground 
structure and of Pragnanz, whereby incomplete fragments are, upon completion, trans- 
formed into simpler, stable and regular figures. Although this phenomenon is often de- 
scribed in terms of contour completion, it clearly demonstrates a figural effect, whereby 
the visual system imposes a figural organisation of the field (and hence figure comple- 
tion), and where the contour results from perceiving a surface, not the other way around, 
again as Kanizsa suggested. Moreover, these subjective figures illustrate the categorial 
and anticipatory character of microgenetic development, such that the perceptual system 
anticipates and actively seeks meaningful structures and immediately categorises them 
on a global dynamic basis” 36 . The crucial role of meaningfulness is demonstrated by the 
fact that no subjective figures arise in perception when the spatial arrangement of in- 
ducers fails to approximate a ‘sensible form’ or when the inducers are themselves mean- 
ingful forms. Actually, in Husserlian terms, meaning “shapes” the forms creatively. In 
order, however, to understand how this shaping takes place we need more information 
about the genealogical aspects of this mysterious process. 

Lastly, Rosenthal presents a specific illustration of certain principles of microgenetic 
theory in the field of reading. A further step in the outlining of a genetic phenomeno- 
logical science of embodied cognition. 

According to Y.M. Visetti, ordinary perception does not constitute a foundation for lin- 
guistic but rather an essential correlate and a particular illustration of the construction of 




20 



ARTURO CARSETTI 



meaning. As he writes in his chapter "Language, Space and the Theory of Semantic 
Forms" perception "... has to be considered as instantiating a general structure of cog- 
nition, and not only as resorting to a purely sensorial and peripheral organisation. As a 
slogan, we could say that ‘to perceive is from a single move to act and to express’. Per- 
ception already gives access to, and sketches, a meaning. It implies not only the presence 
of things, but a perspective of the subject, and a suggestion of acting. Perception in space 
is not grasping pure configurations or shapes, nor only a basis for other, subsequent ‘as- 
sociative’ or ‘metaphorical’ interpretations: it is from the outset a dynamic encounter of 
‘figures’ with no necessary dissociation between forms and values, apprehended in the 
course of actions, and deeply qualified by a specific mode of access or attitude. It is this 
notion of a qualified relation (which is a way of ‘accessing’, of ‘giving’ of ‘apprehend- 
ing’....) that we want to transpose into semantics, in order to view it as a kind of percep- 
tion and/or construction of forms. At this level, any distinction between abstract or con- 
crete, or between interior or exterior perception, is irrelevant. In the same way as there 
is more than topology or geometry in our multiple relations to ambiant space, we can say 
that ‘figures’ are objective counterparts, phenomenological manifestations of the rela- 
tions we have with them” 37 . 

In such a framework “schemes" are not formal types, but “potentials” to be actualised. 
In the same way forms are to be considered as the result of dynamical stabilisation 
processes, i.e. as units in an ongoing continuous flow. As Visetti remarks recent advances 
in the theory of complex systems allow us to conceive a unified setting for language ac- 
tivity considered as a construction of forms in a semantic field. Hence the revisitation of 
an Humboltdian conception of language which considers it not as a finished product but 
as a self-organized activity. 

One of the major aims of Wimmer’s chapter "Emotion-Cognition Interaction and Lan- 
guage” is to show that language and the required underlying levels of emotion and cog- 
nition appears as an interacting phenomenon. None of these three functions can be iso- 
lated. There is neither a pure emotion nor a pure cognition nor any kind of ideal language 
without any relationship to the underlying levels of being. The roots of Wimmer’s con- 
siderations may be found in the Evolutionary Epistemology and in the Genetic Episte- 
mology. According to a well known K. Lorenz’s statement, life itself can be considered 
as a “knowledge gaining process”. In Piaget’s Genetic Epistemology, on the other hand, 
we can find a similar type of naturalistic account: life itself is considered as a self-regu- 
latory process. 

In accordance with this theoretical setting, Wimmer, paraphrasing I. Kant, formulates 
a basic hypothesis: “Affects without cognitions are blind and cognitions without affects 
are empty" "What does this mean and what contributes this hypothesis to language re- 
lated issues? The core of the argument is the assumption that from an evolutionary-phy- 
logenetical viewpoint, the distinction between affect and cognition seems to be artifi- 
cially drawn leading to wrong conclusions. The sharp distinction between affect and cog- 
nition has deep roots in our cultural heritage, leading back to ancient Greek philoso- 
phy. (comp. Averill 1996; Gardiner/Metcalf/Beebe-Center 1937) Beside these historical 
roots also recent neuroanatomical and neurophysiological research indicates a distinc- 
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tion between brain areas and mechanisms responsible for affective and cognitive 
processes. (Panksepp 1998; MacLean 1990) In contrast to these considerations an evo- 
lutionary approach leads to the assumption that there exists one common root of emo- 
tion and cognition. A root which in its early and very basic form is very close to ho- 
moeostatic - regulatory mechanisms. The root metaphor is very helpful in proposing a 
picture of one common root, which branches off in different branches always remaining 
closely related to the basic root. Even (in phylogenetic al dimensions) the very young 
ability of language usage can be traced back to this basis root” 38 . 

If we consider cognition (and consequently perception) as the result of a coupled 
process, intrinsically related each time, from a phenomenological point of view, to the 
constitution of an inner horizon (or of a multiplicity of horizons) emotion clearly appears 
radically hinged on specific cognitive schemes. 

Also for M. Stadler and P. Kruse in the chapter “Appearance of Structure and Emer- 
gence of Meaning in the Visual System” the brain is a self-organising system. Cognitive 
processes are actually based on the elementary neural dynamics of the brain. In this 
sense the synergetic approach can be concretised in three empirical hypotheses: “ -It is 
possible to demonstrate non-linear phase transitions in cognition. For example continu- 
ous changes in stimulus conditions are able to trigger sudden reorganisations in percep- 
tion. Perceptual organisation cannot be reduced to the properties of the stimulus. 

-Stable order in cognition is the result of underlying neuronal dynamics and therefore 
critically bound to instability. For example any percept is the result of a process of dy- 
namic order formation. Because of the underlying dynamics perception is in principle 
multistable. Each stable percept can be destabilised and each instable percept can be sta- 
bilised. 

-Meaning is an order parameter of the elementary neuronal dynamics. For example in 
the instability of ambiguous displays the basic order formation of perception can be in- 
fluenced by subtle suggestive cues” 39 . 

The fact that meaning may influence the structure of brain processes is predicted by the 
synergetic model of mind-brain interaction. In order to establish a good model for 
macrodynamic brain-mind processes we need to define specific order parameters which 
emerge out of the elementary dynamics and which transform the basic instability into co- 
herent stable patterns. 

When we consider forms as the result of dynamical stabilisation processes also utilis- 
ing the more recent advances in the theory of complex systems, we have the concrete 
possibility to conceive some new methodological tools in order to investigate the prob- 
lem of form construction, i.e. to outline a theory both phenomenological and physical 
relative to the actual emergence of meaning: and we have just seen that meaning 
“shapes” the forms creatively. However, in order to understand in what way this “shap- 
ing” takes place we need, as we have just said, more information about the genealogical 
aspects of the process. Moreover, we also need a semantic and dynamic handling of the 
processes of information compression as they express themselves at the biological level. 
In particular we need more and more adequate measures of meaningful complexity, ca- 
pable, for instance, of taking into account also the dynamic and interactive aspects of 
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depth information. In short, we need new models of cognition. Functional and co-evolu- 
tive models not static or interpretative ones. At the level of this kind of models, emer- 
gence (in a co-evolutive landscape) and truth (in an intensional setting) for many aspects 
will necessarily coincide. 

The chapter by A, Carsetti “The Embodied Meaning: Self-Organisation and Symbolic 
Dynamics in Visual Cognition” seeks to present some aspects of contemporary attempts 
to “reconstruct” a genealogy of vision through a precise revisitation of some of Husserl’s 
original intuitions. Also in this case, this revisitation is operated in the general frame- 
work of the contemporary theoretical development of Self-organisation Theory. 

In Carsetti’s opinion “...vision is the end result of a construction realised in the condi- 
tions of experience. It is “direct” and organic in nature because the product of neither 
simple mental associations nor reversible reasoning, but, primarily, the “harmonic” and 
targeted articulation of specific attractors at different embedded levels. 

The resulting texture is experienced at the conscious level by means of self-reflection, 
we really sense that it cannot be reduced to anything else, but is primary and self-con- 
stituting. We see visual objects; they have no independent existence in themselves but 
cannot be broken down into elementary data. Grasping the information at the visual lev- 
el means managing to hear, as it were, inner speech. It means reconstructing in the neg- 
ative, in an inner generative language, through progressive assimilation, selection and re- 
al metamorphosis (albeit partially and roughly) the articulation of the complex “ge- 
nealogical” apparatus which works at the deep semantic level and moulds and subtends 
the presentation of the functional patterns at the level of the optical sieve. Vision as emer- 
gence aims first of all to grasp the paths and the modalities that determine the selective 
action, the modalities specifically relative to the revelation of the afore-mentioned appa- 
ratus at the surface level according to different and successive phases of generality.... 
The afore-mentioned paths and modalities thus manage to “speak” through my own fi- 
bres. It is exactly through a similar self-organising process, characterised by the presence 
of a double-selection mechanism, that the brain can partially manage to perceive depth 
information in an objective way. The extent to which the simulation model succeeds, al- 
beit partially, in encapsulating the secret cipher of this articulation through a specific 
chain of programs determine the irruption of new creativity as well as the model’s abil- 
ity to see with the eyes of the mind. 

To assimilate and see the system must first “think” internally the secret structures of the 
possible, and then posit itself as a channel (through the precise indication of forms of po- 
tential coagulum) for the process of opening and revelation of depth information. This 
process then works itself gradually into the system’s fibres, via possible selection, ac- 
cording to the coagulum possibilities offered successively by the system itself. 

The revelation and channelling procedures thus emerge as an essential and integrant 
part of a larger and coupled process of self-organisation. In connection with this process 
we can ascertain the successive edification of an I-subject conceived as a progressively 
wrought work of abstraction, unification, and emergence. The fixed points which man- 
age to articulate themselves within this channel, at the level of the trajectories of neural 
dynamics, represent the real bases on which the “I” can reflect and progressively con- 
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stitute itself. The I-subject can thus perceive to the extent in which the single visual per- 
ceptions are the end result of a coupled process which, through selection, finally leads 
the original Source to articulate and present itself, by means of cancellations and “irrup- 
tions”, within (and through) the architectures of reflection, imagination and vision. 
These perceptions are (partially) veridical, direct, and irreducible. They exist not in 
themselves, but, on the contrary, for the “I”, but simultaneously constitute the primary 
departure-point for every successive form of reasoning perpetrated by the observer. As 
an observer I shall thus witness Nature/ Naturata since I have connected functional forms 
in accordance with a successful and coherent “score”. 

It is precisely through a coupled process of self-organisation of the kind that it will final- 
ly be possible to manage to define specific procedures of reconstruction and representation 
within the system, whereby the system will be able to identify a given object within its con- 
text, together with its Sinn. The system will thus be able to perceive the visual object as im- 
mersed within its surroundings, as a self-sustaining reality, and, at the same time, feel it liv- 
ing and acting objectively within its own fibres. In this way it will be possible for the brain 
to perceive depth information according to the truth (albeit partially)” 40 . 

At the end of this short and incomplete presentation of the main guidelines of the book, 
let us now make just a few final remarks. 

According to the suggestions presented by the authors in the different chapters the 
world perceived at the visual level appears as constituted not by objects or static forms, 
but by processes appearing imbued with meaning. As Kanizsa stated, at the visual level 
the lin &per se does not exist: only the line which enters, goes behind, divides, etc.: a line 
evolving according to a precise holistic context, in comparison with which function and 
meaning are indissolubly interlinked. The static line is in actual fact the result of a dy- 
namic compensation of forces. Just as the meaning of words is connected with a universe 
of highly-dynamic functions and functional processes which operate syntheses, cancel- 
lations, integrations, etc. (a universe which can only be described in terms of symbolic 
dynamics), in the same way, at the level of vision, we must continuously unravel and 
construct schemata', must assimilate and make ourselves available for selection by the 
co-ordinated information penetrating from external reality. We have, at the same time, to 
inventively explore the secret architecture of non-standard grammars governing visual 
perception. Lastly, we must interrelate all this with the internal selection mechanisms 
through a precise “journey” into the regions of intensionality. 

In accordance with these intuitions, we may tentatively consider, from the more gener- 
al point of view of contemporary Self-organisation theory, the network of meaningful 
programs living at the level of neural systems as a complex one which forms, articulates, 
and develops, functionally, within a “coupled universe” characterised by the existence of 
a double selection. This network gradually posits itself as the basis for the emergence of 
meaning and the simultaneous, if indirect, surfacing of an “acting I”: as the basic instru- 
ment, in other words, for the perception of real and meaningful processes, of “objects” 
possessing meaning, aims, intentions, etc.: above all, of objects possessing an inner plan 
and linked to the progressive expression of a specific cognitive action. 

The brain considered as an “intelligent” network which develops with its meaning ar- 
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ticulates as a growing neuronal network through which continuous restructuring process- 
es are effected at a holistic level, thus constituting the indispensable basis of visual cog- 
nitive activity. The process is first of all, as stated above, one of canalisation and of rev- 
elation (according in primis to specific reflection procedures) of precise informational 
(and generative) fluxes-principles. It will necessarily articulate through schemata which 
will stabilise within circuits and flux determinations. In this sense the brain progressive- 
ly constitutes itself as a self-organising measuring device in the world and of the world. 
When, therefore, the model-network posits itself as a ‘I-representation’ (when the arch of 
simulation “reaches completion”), and views the world-Nature before it, it sees the world 
in consonance with the functional forms on which its realisation was based, i.e. accord- 
ing to the architecture proper to the circuits and the patterns of meaning which managed 
to become established. The result is Nature written in mathematical formulae: Nature read 
and seen iuxta propria principia as a great book (library) of natural forms by means of 
symbolic characters, grammatical patterns and specific mathematical modules. 

From a general point of view, at the level of the articulation of visual cognition, we are 
actually faced with the existence of precise forms of co-evolution. With respect to this 
dynamic context, we can recognise, at the level of the aforementioned process of inven- 
tive exploration, not only the presence of patterns of self-reflection but also the progres- 
sive unfolding of specific fusion and integration functions. We can also find that the Sinn 
that embodies in specific and articulated rational intuitions guides and shapes the paths 
of the exploration selectively. It appears to determine, in particular, by means of the def- 
inition of precise constraints, the choice of a number of privileged patterns of function- 
al dependencies with respect to the entire relational growth. As a result, we can inspect 
a precise spreading of the development dimensions, a selective cancellation of relations 
and the rising of specific differentiation processes. 

We are faced with a new theoretical landscape characterised by the successive unfold- 
ing (in a co-evolutive context) of specific mental processes submitted to the action of 
well-defined selective pressures and to a continuous “irruption” of depth information. 
This irruption, however, reveals itself as canalised by means of the action of precise con- 
straints that represent the end product of the successive transformation of the original 
gestalten. Actually, the gestalten can “shape” the perceptual space according to a visual 
order only insofar as they manage to act (on the basis of the transformation undergone) 
as constraints concerning the generative (and selective) processes at work. Selection is 
creative because it determines ever-new linguistic functions, ever-new processing units 
which support the effective articulation of new coherence patterns. The development of 
this creativity, however, would not be possible without the above mentioned transfor- 
mation and the inner guide offered by the successful compositio of the different con- 
straints in action. On the other hand, the very irruption could not take place if we were 
not able to explore the non-standard realm in the right way, i.e. if we were not capable 
of outlining adequate non-standard models and continuously comparing, in an “intelli- 
gent” way, our actual native competence with the simulation recipes at work. 

We can perceive the objective existence of specific (self-organising) forms only inso- 
far as we transform ourselves into a sort of arch or gridiron for the articulation, at the 
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second-order or higher-order level and in accordance with specific selective procedures, 
of a series of conceptual plots and fusions, a series that determines a radical metamor- 
phosis of our intellectual capacities. It is exactly by means of the actual reflection on the 
new-generated abstract constructions as well as of the mirror represented by the suc- 
cessful invention of adequate simulation models that I shall finally be able to inspect the 
realisation of my autonomy, the progressive embodiment of my mental activities in a 
“new” coherent and self-sustained system. 

Meaning can selectively express itself only through, a) the nested realisation, at the ab- 
stract level, of specific “fusion” processes, b) the determination of specific schemes of 
coherence able to support this kind of realisation, c) a more and more co-operative and 
unified articulation at the deep level of the primary informational fluxes. It shapes the 
forms in accordance with precise stability factors, symmetry choices, coherent contrac- 
tions and ramified completions. We can inspect (and “feel”) this kind of embodiment, at 
the level of “categorial intuition”, insofar as we successfully manage to reconstruct, 
identify and connect, at the generative level, the attractors of this particular dynamic 
process. It is exactly by means of the successive identification and guided compositio of 
these varying attractors that we can manage to imprison the thread of meaning and iden- 
tify the coherent texture of the constraints concerning the architecture of visual thoughts. 
In this way we shall finally be able to obtain a first self-representation of our mental ac- 
tivities, thus realising a form of effective autonomy. A representation that exactly con- 
cerns the “narration” relative to the progressive opening of the eyes of our mind and the 
correlated constitution of the Cogito and its rules. 
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INTRODUCTION: SEEING AND THINKING BASED ON BOUNDARIES AND SURFACES 

Helmholtz proposed that seeing and thinking are intimately related. He articulated this 
claim as part of his doctrine of unconscious inference. Kanizsa, in contrast, proposed that 
seeing and thinking often function according to different rules. These alternative intel- 
lectual positions remain as an enduring controversy in visual science. Why is the rela- 
tionship between seeing and thinking so hard to disentangle? 

Recent neural models of visual perception have clarified how seeing and thinking op- 
erate at different levels of the brain and use distinct specialized circuits. But these 
processes also interact imtimately via feedback and use similar laminar cortical designs 
that are specialized for their distinct functions. In addition, this feedback has been pre- 
dicted to be an essential component in giving rise to conscious visual percepts, and re- 
cent data have provided support for this prediction. Thus, although seeing and thinking 
are carried out by different parts of the brain, they also often interact intimately via feed- 
forward and feedback interactions to give rise to conscious visual percepts. 

The distinction between seeing and thinking is sometimes caste as the distinction be- 
tween seeing and knowing, or between seeing and recognizing. The fact that these 
processes are not the same can be understood by considering a suitable juxtaposition of 
boundary and surface percepts. The FACADE (Form-And-Color-And-DEpth) theory of 
how the brain gives rise to visual percepts has clarified the sense in which these bound- 
ary and surface percepts compute complementary properties, and along the way how and 
why properties of seeing and recognizing are different (Grossberg, 1984, 1994, 1997; 
Grossberg and Kelly, 1999; Grossberg and McLoughlin, 1997; Grossberg and Mingolla, 
1985a, 1985b; Grossberg and Pessoa, 1998; Grossberg and Todorovic, 1988; Kelly and 
Grossberg, 2000; McLoughlin and Grossberg, 1998). FACADE theory proposes that per- 
ceptual boundaries are formed in the LGN — interblob — interstripe — V4 stream, where- 
as perceptual surfaces are formed in the LGN — blob — thin stripe — V4 stream (Gross- 
berg, 1994); see Figure 1. Many experiments have supported this prediction (Elder and 
Zucker, 1998; Lamme et al., 1999; Rogers-Ramachandran and Ramachandran, 1998). 
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Figure 1. Schematic diagram of processing streams in visual cortex in the macaque monkey brain. Icons indi- 
cate the response selectivities of cells at each processing stage: rainbow = wavelength selectivity, angle sym- 
bol = orientation selectivity, spectacles binocular selectivity, and right-pointing arrow = selectivity to motion in 
a prescribed direction. [Adapted with permission from DeYoe and van Essen (1988).] 



Figure 2a illustrates three pairs of complementary properties using the illusory contour 
percept of a Kanizsa square (Kanizsa, 1974). Such a percept immediately raises the 
question of why our brains construct a square where there is none in the image. There 
are several functional reasons why our brains have developed strategies to construct 
complete representations of boundaries and surfaces on the basis of incomplete infor- 
mation. One reason is that there is a blind spot in our retinas; namely, a region where no 
light-sensitive photoreceptors exist. This region is blind because of the way in which the 
pathways from retinal photoreceptors are collected together to form the optic nerve that 
carries them from the retina to the LGN in Figure 1. We are not usually aware of this 
blind spot because our brains complete boundary and surface information across it. The 
actively completed parts of these percepts are visual illusions, because they are not de- 
rived directly from visual signals on our retinas. Thus many of the percepts that we be- 
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lieve to be “real” are visual illusions whose boundary and surface representations just 
happen to look real. 1 suggest that what we call a visual illusion is just an unfamiliar 
combination of boundary and surface information. This hypothesis is illustrated by the 
percepts generated in our brains from the images in Figure 2. 



f 1 f ^ 

l ^ "Iffl 

(a) (b) 




Figure 2. Visual boundary and surface interactions: (a) A Kanizsa square, (b) A reverse-contrast Kanizsa. (c) 
An object boundary can form around the gray disk even though its contrast reverses relative to the background 
along its perimeter, (d) An invisible, or amodel, vertical boundary, (e) An example of neon color spreading. 



In response to the image in Figure 2a, boundaries form inwardly between cooperating 
pairs of colinear edges of the four pac man, or pie shaped, inducers. Four such contours 
form the boundary of the perceived Kanizsa square. (If boundaries formed outwardly 
from a single inducer, then any speck of dirt in an image could crowd all our percepts 
with an outwardly growing web of boundaries.) These boundaries are oriented to form 
in a collinear fashion between (almost) colinear and (almost) like-oriented inducers. The 
square boundary in Figure 2a can be both seen and recognized because of the enhanced 
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illusory brightness of the Kanizsa square. By contrast, the square boundary in Figure 2b 
can be recognized even though it is not visible; that is, there is no brightness or color dif- 
ference on either side of the boundary. Figure 2b shows that some boundaries can be rec- 
ognized even though they are invisible, and thus that seeing and recognizing cannot be 
the same process. FACADE theory predicts that “all boundaries are invisible” within the 
boundary stream, which is proposed to occur in the interblob cortical processing stream 
(Figure 1). This prediction has not yet been directly tested neurophysiologically, al- 
though several studies have shown that the distinctness of a perceptual grouping, such 
as an illusory contour, can be dissociated from the visible stimulus contrast with which 
it is associated (Fless et al., 1998; Petry and Meyer, 1987). 

Why is the square boundary in Figure 2b invisible? This property can be traced to the 
fact that its vertical boundaries form between black and white inducers that possess op- 
posite contrast polarity with respect to the gray background. The same is true of the 
boundary around the gray disk in Figure 2c, which is another figure that was originally 
proposed by Kanizsa, but to make a different point. In this figure, the gray disk lies in 
front of a textured background whose contrasts with respect to the disk reverse across 
space. In order to build a boundary around the entire disk, despite these contrast rever- 
sals, the boundary system pools signals from opposite contrast polarities at each posi- 
tion. This pooling process renders the boundary-system output insensitive to contrast po- 
larity. The boundary system therefore loses its ability to represent visible colors or 
brightnesses, as its output cannot signal the difference between dark and light. It is in this 
sense that “all boundaries are invisible”. Figure 2d illustrates another invisible boundary 
that can be consciously recognized. Figure 2 hereby illustrates that seeing and recogniz- 
ing must use different processes, since they can be combined or dissociated, in response 
to relatively small changes in the contrasts of an image, holding its geometrical rela- 
tionships constant, or indeed by changing the geometrical relationships of an image, 
while holding its contrasts constant. 

If boundaries are invisible, then how do we see anything? FACADE theory predicts 
that visible properties of a scene are represented by the surface processing stream, which 
is predicted to occur within the blob cortical stream (Figure 1). A key step in represent- 
ing a visible surface is “filling-in”. Why does this step occur? An early stage of surface 
processing compensates for variable illumination, or “discounts the illuminant” (Gross- 
berg and Todorovic, 1988; Helmholtz, 1910/1925; Land 1977), in order to prevent illu- 
minant variations, which can change from moment to moment, from distorting all per- 
cepts. Discounting the illuminant attenuates color and brightness signals, except near re- 
gions of sufficiently rapid surface change, such as edges or texture gradients, which are 
relatively uncontaminated by illuminant variations. Later stages of surface formation fill 
in the attenuated regions with these relatively uncontaminated color and brightness sig- 
nals, and do so at the correct relative depths from the observer through a process called 
surface capture. This multi-stage process is an example of hierarchical resolution of un- 
certainty, because the later filling-in stage overcomes uncertainties about brightness and 
color that were caused by discounting the illuminant at an earlier processing stage. 

How do the illuminant-discounted signals fill-in an entire region? Filling-in behaves 
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like a diffusion of brightness across space (Arrington, 1994; Grossberg and Todorovic, 
1988; Paradiso and Nakayama, 1991). Figure 2e leads to a percept of “neon color 
spreading” in which, filling-in spreads outwardly from the individual gray inducers in all 
directions. Its spread is thus unoriented. How is this spread of activation contained? FA- 
CADE theory predicts that signals from the boundary stream to the surface stream de- 
fine the regions within which filling-in is restricted. In response to Figure 2e, the brain 
forms boundaries surround the annuli, except for small breaks in the boundaries where 
the gray and black contours intersect, and also forms the square illusory boundary. Some 
of the gray color can escape from their annuli through these breaks into the square re- 
gion in the surface stream. This prediction has not yet been tested neurophysiologically. 
Without these boundary signals, filling-in would dissipate across space, and no surface 
percept could form. Invisible boundaries therefore indirectly assure their own visibility 
through their interactions with the surface stream. 

In Figure 2a, the square boundary is induced by four black pac man disks that are all 
less luminant than the white background. In the surface stream, discounting the illumi- 
nant causes these pac men to induce local brightness contrasts within the boundary of the 
square. At a subsequent processing stage, these brightness contrasts trigger surface fill- 
ing-in within the square boundary. The filled-in square is visible as a brightness differ- 
ence because the filled-in activity level within the square differs from the filled-in ac- 
tivity of the surrounding region. Filling-in can lead to visible percepts because it is sen- 
sitive to contrast polarity. These three properties (outward, unoriented, sensitive to con- 
trast-polarity) are complementary to the corresponding properties (inward, oriented, in- 
sensitive to contrast-polarity) of boundary completion. 

In Figure 2b, the opposite polarities of the two pairs of pac men with respect to the gray 
background lead to approximately equal filled-in activities inside and outside the square, so 
the boundaiy can be recognized but not seen. In Figure 2d, the white background can fill-in 
uniformly on both sides of the vertical boundary, so no visible contrast difference is seen. 

These remarks just begin the analysis of filling-in. Even in the seemingly simple case 
of the Kanizsa square, one often perceives a square hovering in front of four partially oc- 
cluded circular disks, which seem to be completed behind the square. FACADE theory 
predicts how surface filling-in is organized to help such figure — ground percepts to oc- 
cur, in response to both 2-D pictures and 3-D scenes (Grossberg, 1994, 1997). 

In summary, boundary and surface formation illustrate two key principles of brain or- 
ganization: hierarchical resolution of uncertainty and complementary interstream inter- 
actions. Hierarchical resolution of uncertainty is illustrated by surface filling-in: dis- 
counting the illuminant creates uncertainty by suppressing surface color and brightness 
signals, except near surface discontinuities. Higher stages of filling-in complete the sur- 
face representation using properties that are complementary to those by which bound- 
aries are formed, guided by signals from these boundaries (Arrington, 1994; Grossberg, 
1994; Grossberg and Todorovic, 1988; Paradiso and Nakayama, 1991). 

Before going further, it should be emphasized that the proposal that boundaries and sur- 
faces are computed by the interblob and blob streams is not the same as the proposal of 
Hubei and Livingstone (1985) that orientation and color are computed by these streams. 
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Some differences between these proposals are: (1) Illusory contours are boundaries that 
form over regions that receive no oriented bottom-up inputs. Boundaries that can form over 
regions that receive no oriented inputs cannot be viewed as part of an “orientation” system. 
(2) Boundaries are predicted to be invisible, or amodal, throughout the boundary stream, 
whether real or illusory. This is a concept that goes far beyond any classical notion of an 
“orientation” system. (3) Surfaces are predicted to fill-in amodally, or invisibly, in cortical 
area V2, but modally, or visibly, in cortical area V4. A surface system that can fill-in amodal 
percepts cannot be viewed as a “color “ system. (4) Boundaries are predicted to organize the 
separation of occluded objects from their occluding objects. The analysis of figure-ground 
separation also goes far beyond any direct concept of an “orientation” system. Because of 
such differences, FACADE theory has been able to propose explanations of many experi- 
ments that cannot be explained just using classical concepts of orientation and color. With- 
in FACADE theory, the circuits that form perceptual boundaries are called the Boundary 
Contour System, or BCS, and the circuits that form perceptual surfaces are called the Fea- 
ture Contour System, or FCS. This nomenclature arose from the realization that, in both sys- 
tems, an early stage of processing extracts contour-sensitive information as a basis for fur- 
ther processing. In the BCS, these contours are completed as invisible perceptual bound- 
aries, In the FCS, these contours are extracted by the process of “discounting the illuminant” 
and form the basis for percepts of visible filled-in surface “features.” 

The percepts derived from Figure 2 clarify that seeing and thinking are different 
processes, but do not indicate where the thinking processes take place. Much evidence 
suggests that objects are recognized in the inferotemporal cortex, which may be primed 
by top-down inputs from prefrontal cortex; see Figure 2. Thus, if an amodal boundary in 
the interblob boundary stream of visual cortex is projected to a familiar recognition cat- 
egory in the inferotemporal cortex, it can be recognized even if it cannot be seen in area 
V4 of the blob surface stream. Recognition is not, however, merely a matter of feedfor- 
ward activation of the inferotemporal cortex. Nor is seeing just a matter of feedforward 
activation of area V4. Much experimental and theoretical evidence, notably as explained 
within Adaptive Resonance Theory, or ART, suggests that top-down matching and at- 
tentional processes are normally part of the events that lead to conscious recognition, 
even of an amodal percept (Grossberg, 1980, 1999c). 



BOUNDARY COMPLETION AND ATTENTION 
BY THE LAMINAR CIRCUITS OF VISUAL CORTEX 

How does visual cortex complete boundaries across gaps due to internal brain imperfec- 
tions, such as the retinal blind spot, or due to incomplete contours in external inputs, such 
as occluded surfaces, spatially discrete texture elements, illusory contour stimuli, or even 
missing pixels in impressionist paintings? This information is shown below to lead to new 
insights about processes like figure-ground perception that clarify many of the percepts 
that Kanizsa earlier observed in this field. In particular, the BCS model proposes how long- 
range horizontal cooperation interacts with shorter-range competition to carry out percep- 
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Figure 3. Boundary and attentional laminar circuits in interblob cortical areas Vi and V2. Inhibitory intemeu- 
rons are shown filled-in black, (a) The LGN provides bottom-up activation to layer 4 via two routes: It makes 
strong connections directly into layers 4 and 6. Layer 6 then activates layer 4 via a 6 -► 4 on-center off-surround 
network. In all, LGN pathways to layer 4 form an on-center off-surround network, which contrast-normalizes 
layer 4 cell responses, (b) Folded feedback carries attentional signals from higher cortex into layer 4 of Vi, via 
the 6 -► 4 path. Corticocortical feedback axons tend to originate in layer 6 of the higher area and to terminate in 
the lower cortex’s layer 1, where they excite apical dendrites of layer 5 pyramidal cells whose axons send col- 
laterals into layer 6. Several other routes through which feedback can pass into V 1 layer 6 exist. Having arrived 
in layer 6, the feedback is then “folded” back up into the feedforward stream by passing through the 6 -► 4 on- 
center off-surround path, (c) Connecting the 6 -► 4 on-center off-surround to the layer 2/3 grouping circuit: like- 
oriented layer 4 simple cells with opposite contrast polarities compete (not shown) before generating half-wave 
rectified outputs that converge onto layer 2/3 complex cells in the column above them. Groupings that form in 
layer 2/3 enhance their own positions in layer 4 via the 6 -► 4 on-center, and suppress other groupings via the 6 
-► 4 off-surround. There exist direct layer 2 / 3 6 connections in macaque Vi, as well as indirect routes via 

layer 5. (d) Top-down corticogeniculate feedback from VI layer 6 to LGN also has an on-center off- surround 
anatomy, similar to the 6 -► 4 path. The on-center feedback selectively enhances LGN cells that are consistent 
with the activation that they cause, and the off-surround contributes to length-sensitive (endstopped) responses 
that facilitate grouping perpendicular to line ends, (e) The model V1/V2 circuit: V2 repeats the laminar pattern 
of VI circuitry at a larger spatial scale; notably, the horizontal layer 2/3 connections have a longer range in V2. 
VI layer 2/3 projects to V2 layers 6 and 4, just as LGN projects to layers 6 and 4 of VI. Higher cortical areas 
send feedback into V2 which ultimately reaches layer 6, just as V2 feedback acts on layer 6 of VI. Feedback 
paths from higher cortical areas straight into VI (not shown) can complement and enhance feedback from V2 
into V 1 . [Reprinted with permission from Grossberg and Raizada (2000).] 
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tual grouping. The cooperating cells were predicted to satisfy a bipole property (Cohen and 
Grossberg, 1984; Grossberg, 1984; Grossberg and Mingolla, 1985a, 1985b). Such “bipole 
cells” realize inward and oriented boundary completion by firing when they receive inputs 
from pairs of (almost) like oriented and (almost) colinear scenic inducers. Von der Heydt 
et al. (1984) reported the first neurophysiological evidence in support of this prediction, by 
observing bipole cell properties in cortical area V2. Subsequent psychophysical studies 
have also provided additional evidence in support of a bipole property during perceptual 
grouping. Field et al. (1993) called this property an “association field,” and Shipley and 
Kellman (1992) called it a “relatability condition”. 

More recently, the BCS was extended to clarify how and why visual cortex, indeed all 
sensory and cognitive neocortex, is organized into layered circuits (Grossberg, 1999b; 
Grossberg and Raizada, 2000; Raizada and Grossberg, 2001). This LAMINART model 
predicts how bottom-up, top-down, and horizontal interactions within the cortical layers 
realize: (1) perceptual grouping; (2) attention; and (3) stable development and learning. 
The model proposes how mechanisms that achieve property (3) imply (1) and (2). That 
is, constraints on stable development of cortical circuits in the infant determine proper- 
ties of learning, perception, and attention in the adult. 

Figure 3 summarizes how known laminar cortical circuits may carry out perceptual group- 
ing and attention. This summary omits binocular interactions for simplicity; but see below. 
The lateral geniculate nucleus, or LGN, directly activates VI layers 4 and 6 (Figure 3a). 
Layer 6, in turn, sends on-center off-surround inputs to the simple cells of layer 4. Layer 6 
can strongly inhibit layer 4 through the off-surround, but the excitatory and inhibitory in- 
puts in the on-center are proposed to be approximately balanced, with perhaps a slight ex- 
citatory bias. Layer 6 can thus modulate the excitability of layer 4 cells, but not drive them 
to fire vigorously. This balance has been shown through modeling simulations to help the 
cortex develop its connections in a stable way (Grossberg and Williamson, 2001). The di- 
rect LGN-to-4 connections are proposed to drive layer 4 cells to reach suprathreshold acti- 
vation levels. The direct and indirect pathways from LGN-to-4 together form an on-center 
off-surround network. Under the assumption that layer 4 cells obey membrane, or shunting, 
equations, such an on-center off-surround network can contrast-normalize the responses of 
layer 4 cells (Douglas et al., 1995; Grossberg, 1973, 1980; Heeger, 1992), and thus preserve 
their sensitivity to input differences over a wide dynamic range. 

Figure 3b illustrates how the modulatory layer 6-to-4 circuit can also be used by top- 
down signals from V2 layer 6 to attentionally modulate the excitability of VI layer 4 
cells, while inhibiting layer 4 cells that are not in the attentional focus. 

Boundary completion that obeys a bipole property occurs within layer 2/3, as illustrat- 
ed in Figure 3c. Layer 4 cells activate layer 2/3 complex cells, which communicate with 
their layer 2/3 neighbors via long-range horizontal excitatory connections and shorter- 
range inhibitory interneurons. The strengths of these excitatory and inhibitory interac- 
tions are predicted to be approximately balanced. This balance has also been proposed 
to help ensure that the cortex develops its connections in a stable way (Grossberg and 
Williamson, 2001). Because of this balance, activation of a single layer 2/3 cell causes 
its horizontal excitatory and inhibitory connections to be approximately equally activat- 
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ed. The inhibition cancels the excitation, so a boundary cannot shoot out from individ- 
ual scenic inducers. When two or more (approximately) like-oriented and (approximate- 
ly) colinear layer 2/3 cells are activated, the excitation that converges on cells in between 
can summate, but the inhibitory interneurons form a recurrent inhibitory net that nor- 
malizes its total activity. As a result, total excitation exceeds total inhibition, and the cells 
can fire. A boundary can hereby be completed inwardly, but not outwardly. 

If a scene has unambigous groupings, then this horizontal interaction can rapidly com- 
plete boundaries along a feedforward pathway from layer 4 to layer 2/3 and then hori- 
zontally across layer 2/3, from which outputs to higher cortical areas are emitted. This 
property is consistent with recent data showing that very fast recognition of visual scenes 
is possible (Thorpe et ah, 1996). On the other hand, it is also well-known that some 
scenes take longer to recognize. Within the model, in response to scenes with multiple 
possible groupings, competitive interactions within layers 2/3 and 4 can keep the layer 
2/3 activities small, and thus prevent large output signals from being rapidly emitted to 
higher cortical areas. These smaller layer 2/3 activities are large enough, however, to 
generate positive feedback signals between layers 2/3-6-4-2/3 of their own cortical area 
(see Figure 3c). The positive feedback signals can quickly amplify the activities of the 
strongest grouping, which can then generate large outputs from layer 2/3, while its strong 
layer 6-to-4 off-surround signals inhibit weaker groupings. These intracortical feedback 
signals convert the cells across the layers into functional columns (Mountcastle, 1957) 
and show that the classical Hubei and Wiesel (1977) proposal that there are feedforward 
interactions from layer 4 simple cells to layer 2/3 complex cells is part of a more com- 
plex circuit which also ties these cells together using nonlinear feedback signals. 

The above discussion shows that the layer 6-to-4 circuit has at least three functions: It 
contrast-normalizes bottom-up inputs, selects groupings via intracortical feedback from 
layer 2/3 without causing a loss of analog sensitivity, and primes attention via intercor- 
tical feedback from higher cortical areas. This intimate connection between grouping 
and attention enables attention to flow along a grouping, and thereby selectively enhance 
an entire object, as Roelfsema et al. (1998) have shown in macaque area Vi. Because at- 
tention acts through an on-center off-surround circuit, it can “protect” feature detectors 
from inhibition by distractors by using its off-surround, as Reynolds et al. (1999) have 
shown in areas V2 and V4. Because both cooperation and competition influence group- 
ings, the effects of colinear inducers can be either facilitatory or suppressive at different 
contrasts, as Polat et al. (1998) have shown in area VI. Grossberg and Raizada (2000) 
and Raizada and Grossberg (2001) have quantitatively simulated these and related neu- 
rophysiological data using the LAMINART model. 

The model proposes that a top-down on-center off-surround network from V 1 layer 6 
to the LGN (Figure 3d) can act in much the same way as the top-down signals from V2 
layer 6 to V 1 . The existence of this type of top-down modulatory corticogeniculate feed- 
back was originally predicted in Grossberg (1976), and has recently been supported by 
neurophysiological data of Sillito et al. (1994). Grossberg (1976) also predicted that such 
top-down modulatory feedback helps to stabilize the development of both bottom-up and 
top-down connections between the LGN and V 1 . This prediction has not yet been neu- 
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rophysiologically tested, although it is consistent with evidence of Murphy et al. (1999) 
showing that the top-down signals from an oriented cortical cell tend to distribute them- 
selves across the LGN in an oriented manner that is consistent with such a learning 
process. Figure 3e synthesizes all of these LGN and cortical circuits into system archi- 
tecture, which shows that the horizontal interactions within V2 layer 2/3 can have a 
broader spatial extent than those in VI layer 2/3. The V2 interactions are proposed to 
carry out perceptual groupings like illusory contours, texture grouping, completion of 
occluded objects, and bridging the blind spot. The VI interactions are proposed to im- 
prove signal-to-noise of feature detectors within the VI cortical map. 

The LAMINART model has been extended to clarify how depthful boundaries are 
formed (Grossberg and Howe, 2002; Howe and Grossberg, 2001) and how slanted sur- 
faces are perceived in 3-D (Swaminathan and Grossberg, 2001). This generalization is 
consistent with laminar anatomical and neurophysiological data. For example, suppose 
that a scenic feature activates monocular simple cells in layer 4 via the left eye and right 
eye. These simple cells are sensitive to the same contrast polarity and orientation. Be- 
cause they are activated by different eyes, they are positionally displaced with respect to 
one another on their respective retinas. These monocular simple cells activate disparity- 
selective binocular simple cells in layer 3B. Binocular simple cells that are sensitive to 
the same disparity but to opposite contrast polarities then activate complex cells in lay- 
er 2/3A. This two-stage process enables the cortex to binocularly match features with the 
same contrast polarity, yet to also form boundaries around objects in front of textured 
backgrounds, as in Figure 2c. 



3-D VISION AND FIGURE-GROUND SEPARATION 

How are depthful boundary and surface representations formed? How are percepts of 
occluding and occluded objects represented in depth? FACADE theory proposes how 
such percepts arise when boundary and surface representations interact together during 
3-D vision and figure-ground perception. The rest of this section summarizes some of 
key design problems that need to be solved before outlining model mechanisms that em- 
body a proposed solution of these problems. A key insight is that the bipole property that 
controls perceptual grouping also initiates figure-ground separation. How FACADE the- 
ory explains figure-ground separation will then be illustrated with a simulation example 
of Bregman-Kanizsa figure-ground separation. 

1. 3-D Surface Capture and Filling-in. How are the luminance and color signals that are 
received by the two eyes transformed into 3-D surface percepts? FACADE theory posits 
that multiple depth-selective boundary representations exist and interact with multiple sur- 
face filling-in domains to determine which surfaces in depth can be seen. The same filling- 
in processes which enable us to see perceptual qualities like brightness and color are here- 
by predicted to also determine the relative depth of these surfaces. In particular, depth-se- 
lective boundaries selectively capture brightness and color signals at the subset of filling-in 
domains with which they interact. Filling-in of these captured signals leads to surface per- 
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cepts at the corresponding relative depths from the observer. The hypothesis that the same 
filling-in process controls brightness, color, and depth predicts that perceived depth and 
brightness can influence one another. In fact, the property of “proximity-luminance covari- 
ation” means that brighter surfaces can look closer (Egusa, 1983). In particular, brighter 
Kanizsa squares can look closer than their pac man inducers (Bradley and Dumais, 1984). 

2. Binocular Fusion, Grouping, and da Vinci Stereopsis. Granted that surface capture 
can achieve depth-selective filling-in, how are the depth-selective boundaries formed 
that control surface capture? Our two eyes view the world through slightly different per- 
spectives. Their different views lead to relative displacements, or disparities, on their 
retinas of the images that they register. These disparate retinal images are binocularly 
matched at disparity-sensitive cells, as noted above in the discussion of binocular match- 
ing within cortical layer 3B (Grossberg and Howe, 2002). The disparity- sensitive cells 
in the interblobs of area VI are used to form depth-selective boundaries in the interstripes 
of area V2. These boundaries capture surface filling-in signals at the corresponding fill- 
ing-in domains in the thin stripes of area V2, among other places. 

When two eyes view the world, part of a scene may be seen by only one eye. No dis- 
parity signals are available here to determine the depth of the monocularly viewed fea- 
tures, yet they are seen at the correct depth, as during Da Vinci stereopsis (Nakayama 
and Shimojo, 1990). FACADE theory proposes how depth-selective filling-in of a near- 
by binocularly viewed region spreads into the monocularly viewed region to impart the 
correct depth. This proposal also explains related phenomena like the “equidistance ten- 
dency,” whereby a monocularly viewed object in a binocular scene seems to lie at the 
same depth as the retinally most contiguous binocularly viewed object (Gogel, 1965). 
How this works leads to a number of new hypotheses, including how horizontal and 
monocularly-viewed boundaries are added to all boundary representations, and how an 
“asymmetry between near and far” adds boundaries from nearer to farther surface rep- 
resentations (Grossberg, 1994; Grossberg and Howe, 2002; Grossberg and McLoughlin, 
1997). Without these mechanisms, all occluding objects would look transparent. 

5. Multiple Scales into Multiple Boundary Depths. When a single eye views the image 
of an object in depth, the same size of the retinal image may be due to either a large ob- 
ject far away or to a small object nearby. How is this ambiguity overcome to activate the 
correct disparity-sensitive cells? The brain uses multiple receptive field sizes, or scales, 
that achieve a “size-disparity correlation” between retinal size and binocular disparity. It 
has often been thought that larger scales code nearer objects and smaller scales more dis- 
tant objects. For example, a nearer object can lead to a larger disparity that can be binoc- 
ularly fused by a larger scale. In fact, each scale can fuse multiple disparities, although 
larger scales can fuse a wider range of disparities (Julesz and Schumer, 1981). This am- 
biguity helps to explain how higher spatial frequencies in an image can sometimes look 
closer, rather than more distant, than lower spatial frequencies in an image, and how this 
percept can reverse during prolonged viewing (Brown and Weisstein, 1988). FACADE 
theory explains these reversals by analyzing how multiple spatial scales interact to form 
depth-selective boundary groupings (Grossberg, 1994). 

Multiple spatial scales also help to explain how shaded surfaces are seen. In fact, if bound- 
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aries were sensitive only to the bounding edge of a shaded surface, then shaded surfaces 
would look uniformly bright and flat after filling-in occurs. This does not occur because 
boundaries respond to shading gradients as well as to edges. Within each scale, a “bound- 
ary web” of small boundary compartments can be elicited in response to a shading gradient. 
Although the boundaries themselves are invisible, their existence can indirectly be detected 
because the boundaries in a boundary web trap contrasts locally. The different contrasts in 
each of the small compartments leads to a shaded surface percept. Different scales may re- 
act differently to such a shading gradient, thereby leading to a different boundary web of 
small boundary compartments at each depth. Each boundary web can capture and selec- 
tively fill-in its contrasts on a distinct surface representation in depth. The ensemble of these 
filled-in surface representations can give rise to a percept of a shaded surface in depth. 

4. Recognizing Objects vs. Seeing their Unoccluded Parts. In many scenes, some ob- 
jects lie partially in front of other objects and thereby occlude them. How do we know 
which features belong to the different objects, both in 3-D scenes and 2-D pictures? If 
we could not make this distinction, then object recognition would be severely impaired. 
FACADE theory predicts how the mechanisms which solve this problem when we view 
3-D scenes also solve the problem when we view 2-D pictures (Grossberg, 1994, 1997). 






Figure 4. (a) Uppercase gray B letters that are partially occluded by a black snakelike occluder, (b) Same B 
shapes as in (a) except the occluder is white and therefore merges with the remainder of the white background. 
[Adapted with permission from Nakayama, Shimojo, and Silverman (1989).] 



In the Bregman-Kanizsa image of Figure 4a, the gray B shapes can be readily recog- 
nized even though they are partially occluded by the black snakelike occluder. In Figure 
4b, the occluder is removed. Although the same amount of gray is shown in both images, 
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the B shapes are harder to recognize in Figure 4b. This happens because the boundaries 
that are shared by the black occluder and the gray B shapes in Figure 4a are assigned by 
the brain to the black occluder. The bipole property plays an important role in initiating 
this process. The occluder boundaries form within a boundary representation that codes 
a nearer distance to the viewer than the boundary representation of the gray shapes. With 
the shared boundaries removed from the gray B shapes, the B boundaries can be com- 
pleted behind the positions of the black occluder as part of a farther boundary represen- 
tation. The completion of these boundaries also uses the bipole property. These com- 
pleted boundaries help to recognize the B’s at the farther depth. In Figure 4b, the shared 
boundaries are not removed from the gray shapes, and they prevent the completion of 
the gray boundaries. 

To actually do this, the brain needs to solve several problems. First, it needs to figure 
out how geometrical and contrast factors work together. In Figure 4a, for example, the 
T-junctions where the gray shapes intersect the black occluders are a cue for signaling 
that the black occluder looks closer than the gray shapes. However, if you imagine the 
black occluder gradually getting lighter until it matches the white background in Figure 
4b, it is clear that, when the occluder is light enough, the gray shapes will no longer ap- 
pear behind the occluder. Thus, geometrical factors like T-junctions are not sufficient to 
cause figure-ground separation. They interact with contrast relationships within the 
scene too. 

The brain also needs to figure out how to complete the B boundaries “behind” the oc- 
cluder in response to a 2-D picture. In particular, how do different spatial scales get dif- 
ferentially activated by a 2-D picture as well as a 3-D scene, so that the occluding and 
occluded objects can be seen in depth? Moreover, if the B boundaries can be completed 
and thereby recognized, then why do we not see completely filled-in B shapes too, in- 
cluding in the regions behind the black occluder? This state of affairs clarifies that there 
is a design tension between properties needed to recognize opaque objects, including 
where they are occluded, and our ability to see only their unoccluded surfaces. Here 
again, “the asymmetry between near and far” plays a key role, as noted below. 

5. From Boundary-Surface Complementarity to Consistency. Such subtle data make 
one wonder about how the brain evolved to behave in this way. FACADE theory predicts 
how simple mechanisms that realize a few new perceptual principles can explain figure- 
ground data when they interact together. One such principle is that boundary and surface 
computations are complementary, as noted above. How, then, do we see a single percept 
wherein boundaries and surfaces are consistently joined? How does complementarity be- 
come consistency? FACADE theory proposes how consistency is realized by a simple 
kind of feedback that occurs between the boundary and surface streams. Remarkably, 
this feedback also explains many properties of figure-ground perception. Figure-ground 
explanations can hereby be reduced to questions about complementarity and consisten- 
cy, rather than about issues concerning the ecological validity, or probability, of these 
percepts in our experience. The remainder of the chapter sketches some of these princi- 
ples and mechanisms, followed by explanations and simulations of the BregmanKanizsa 
and Kanizsa Stratification percepts. 
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3-D BOUNDARY AND SURFACE FORMATION 

Figure 5 is a macrocircuit of FACADE theory in its present form. This macrocircuit will 
be reviewed to clarify how it embodies solutions to the five design problems that have just 
been summarized. Monocular processing of left-eye and right-eye inputs by the retina and 
LGN discounts the illuminant and generates parallel signals to the BCS and FCS. These 
signals activate model cortical simple cells via pathways 1 in Figure 5, and monocular 
filling-in domains (FIDOs) via pathways 2. Model simple cells have oriented receptive 
fields and come in multiple sizes. Simple cell outputs are binocularly combined at dis- 
parity-sensitive complex and complex end-stopped (or hypercomplex) cells via pathways 
3. Complex cells with larger receptive fields can binocularly fuse a broader range of dis- 
parities than can cells with smaller receptive fields, thereby realizing a “size-disparity cor- 
relation.” Competition across disparity at each position and among cells of a given size 
scale sharpens complex cell disparity tuning (Fahle and Westheimer, 1995). Spatial com- 
petition (end-stopping) and orientational competition convert complex cell responses in- 
to spatially and orientationally sharper responses at hypercomplex cells. 




Figure 5. FACADE macrocircuit showing interactions of the Boundary Contour System (BCS) and Feature 
Contour System (FCS). See text for details. [Reprinted with permission from Kelly and Grossberg (2000).] 













NEURAL MODELS OF SEEING AND THINKING 



43 



How are responses from multiple receptive field sizes combined to generate boundary rep- 
resentations of relative depths from the observer? Hypercomplex cells in area Vi activate bi- 
pole cells in area V2 via pathway 4. The bipole cells carry out long-range grouping and 
boundary completion via horizontal connections that occur in layer 2/3 of area V2 inter- 
stripes. Bipole grouping collects together outputs from hypercomplex cells of all sizes that 
are sensitive to a given depth range. The bipole cells then send excitatory feedback signals 
via pathways 5 back to all hypercomplex cells that represent the same position and orienta- 
tion, and inhibitory feedback signals to hypercomplex cells at nearby positions and orienta- 
tions; cf., layer 2/3-6-4 inhibition in Figure 3e. The feedback groups cells of multiple sizes 
into a BCS representation, or copy, that is sensitive to a range of depths. Multiple BCS 
copies are formed, each corresponding to different (but possibly overlapping) depth ranges. 



T-JUNCTION SENSITIVITY 




IMAGE LONG-RANGE COOPERATION BOUNDARY 
(+) BIPOLE CELLS 



SHORT-RANGE COMPETITION 
(-) HYPERCOMPLEX CELLS 



Figure 6. T-junction sensitivity in the BCS: (a) T-junction in an image, (b) Bipole cells provide long-range co- 
operation (+), whereas hypercomplex cells provide shorter-ranger competition (-). (c) An end-gap in the verti- 
cal boundary arises due to this combination of cooperation and competition. [Reprinted with permission from 
Grossberg (1997).] 



Bipole cells play a key role in figure-ground separation. Each bipole cell has an ori- 
ented receptive field with two branches (Figure 6). Long-range excitatory bipole sig- 
nals in layer 2/3 combines with shorter-range inhibitory signals in layers 4 and 2/3 to 
make the system sensitive to T-junctions (Figure 6). In particular, horizontally-orient- 
ed bipole cells that are located where the top of the T joins its stem receive excitatory 
inputs to both of their receptive field branches. Vertically-oriented bipole cells that 
process the stem of the T where it joins the top receive excitatory support only in the 
one branch that is activated by the stem. Because of this excitatory imbalance, inhibi- 
tion of the stem by the top can cause a gap in the stem boundary, termed an end-gap 
(Figure 6). During filling-in, boundaries contain the filling-in process. Where end- 
gaps occur, brightness or color can flow out of a figural region, much as it flows out 
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of the annuli in Figure 2e during neon color spreading. FACADE theory predicts that 
this escape of color or brightness via filling-in is a key step that initiates figure-ground 
separation (Grossberg. 1994, 1997; Grossberg and McLoughlin. 1997; Kelly and 
Grossberg. 2000). Figure 7 shows a simulation from Kelly and Grossberg (2000) 
which illustrates end gaps in response to the Bregman-Kanizsa image. End-gaps occur 
where the horizontal occluder touches the partially occluded B shape, at both near 
(Figure 7a) and far (Figure 7b) depths. 







(c) (d) 

Figure 7. Binocular boundaries for monocular filling-in in response to a Bregman-Kanizsa image: (a) near 
depth and (b) far depth. Filled-in Monocular FIDOs before boundary pruning occurs: (c) near depth and (d) 
far depth. [Reprinted with permission from Kelly and Grossberg (2000).] 



How do multiple depth-selective BCS copies capture brightness and color signals within 
depth-selective FCS surface representations? This happens in at least two stages. The first 
stage of monocular filling-in domains, or FIDOs, may exist in V2 thin stripes. Each monoc- 
ular FIDO is broken into three pairs of opponent filling-in domains (black/white, red/green, 
blue/yellow) that receive achromatic and chromatic signals from a single eye. A pair of 
monocular FIDOs, one for each eye, corresponds to each depth-selective BCS copy, and re- 
ceives its strongest boundarygating signals from this BCS copy. Each monocular FIDO may 
also receive weaker boundary signals from BCS copies that represent depths near to that of 
its primary BCS copy. In this way, a finite set of FIDOs can represent a continuous change 
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in perceived depth, much as three classes of retinal cones can be used to represent a contin- 
uum of perceived colors. 

Surface capture is triggered when boundary-gating BCS signals interact with illumi- 
nantdiscounted FCS signals. Pathways 2 in Figure 5 input discounted monocular FCS 
signals to all monocular FIDOs. Only some FIDOs will selectively fill-in these signals, 
and thereby lift monocular FIDO signals into depth-selective surface representations for 
filling-in. The boundary signals along pathways 6 in Figure 5 determine which FIDOs 
will fill-in. These boundary signals selectively capture FCS inputs that are spatially co- 
incident and orientationally aligned with them. Other FCS inputs are suppressed. These 
properties arise when double-opponent and filling-in processes within the FIDOs inter- 
act with oriented boundary-gating signals from the boundary representations. How this 
happens, and how it can explain data about binocular fusion and rivalry, among other 
percepts, are discussed in Grossberg (1987). 

Because these filled-in surfaces are activated by depth-selective BCS boundaries, they 
inherit the depths of their boundaries. 3-D surfaces may hereby represent depth as well 
as brightness and color. This link between depth, brightness, and color helps to explain 
“proximityluminance covariation,” or why brighter surfaces tend to look closer; e.g., 
Egusa (1983). 

Not every filling-in event can generate a visible surface. Because activity spreads un- 
til it hits a boundary, only surfaces that are surrounded by a connected BCS boundary 
are effectively filled-in. Otherwise, the spreading activity can dissipate across the FIDO. 
This property helps to explain data ranging from neon color spreading to how T-junc- 
tions influence 3-D figure-ground perception (Grossberg, 1994). Figures 7c and 7d il- 
lustrate how filling-in occurs in response to the Bregman-Kanizsa boundaries of Figures 
7a and 7b. The connected boundary surrounding the occluder can contain its filled-in ac- 
tivity, but activity spreads through the end-gaps of the B boundaries, thereby dissipating 
across space, at both near (Figure 7c) and far (Figure 7d) depths. 

An analysis of how the BCS and FCS react to 3-D images shows that too many bound- 
ary and surface fragments are formed as a result of the size-disparity correlation. This re- 
dundancy is clear in Figure 7. As noted above, larger scales can fuse a larger range of 
disparities than can smaller scales. How are the surface depths that we perceive selected 
from this range of possibilities across all scales? The FACADE theory answer to this 
question follows from its answer to the more fundamental question: How is perceptual 
consistency derived from boundary-surface complementarity? FACADE theory predicts 
how this may be achieved by feedback between the boundary and surface streams, that 
is predicted to occur no later than the interstripes and thin stripes of area V2. This mu- 
tual feedback also helps to explain why blob and interblob cells share so many receptive 
field properties even though they carry out such different tasks. In particular, boundary 
cells, which summate inputs from both contrast polarities, can also be modulated by sur- 
face cells, which are sensitive to just one contrast polarity. 

Boundary-surface consistency is realized by a contrast-sensitive process that detects the 
contours of successfully filled-in regions within the monocular FIDOs. Only successfully 
filled-in regions can activate such a contour-sensitive process, because other regions either 
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do not fill-in at all, or their filling-in dissipates across space. These filled-in contours acti- 
vate FCS-to-BCS feedback signals (pathways 7 in Figure 5) that strengthen boundaries at 
their own positions and depths, while inhibiting redundant boundaries at farther depths. 
Thus the feedback pathway forms an on-center off-surround network whose inhibition is bi- 
ased towards farther depths. This inhibition from near-to-far is called “boundary pruning.” 
It illustrates a perceptual principle called the “asymmetry between near and far.” This prin- 
ciple shows itself in many data, including 3-D neon color spreading (Nakayama et al., 
1990). Grossberg (1994, 1999a) discusses how to explain such data. 

How does boundary pruning influence figure-ground separation? Boundary pruning 
spares the closest surface representation that successfully fills-in a region, and inhibits 
redundant copies of occluding object boundaries that would otherwise form at farther 
depths. When these redundant occluding boundaries are removed, the boundaries of par- 
tially occluded objects can be completed behind them within BCS copies that represent 
farther depths, as we perceive when viewing Figure 4a but not 4b. Moreover, when the 
redundant occluding boundaries collapse, the redundant surfaces that they momentarily 
supported at the monocular FIDOs also collapse. Occluding surfaces hereby form in 
front of occluded surfaces. 




Figure 8. Amodal boundary and surface representations in response to a Bregman-Kanizsa image. Binocular 
boundaries after boundary pruning occurs: (a) near depth and (b) far depth. Filled-in amodal surface represen- 
tations at the Monocular FIDOs: (c) near depth and (d) far depth. [Reprinted with permission from Kelly and 
Grossberg (2000).] 
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Figures 8a and 8b illustrate boundary pruning and its asymmetric action from near-to- 
far. The near boundaries in Figure 7a are retained in Figure 8a. But the far boundary of 
the occluder in Figure 7b is inhibited by boundary pruning signals from the contour of 
the near filled-in surface representation in Figure 7c. When these occluder boundaries 
are eliminated, the B boundary can be colinearly completed, as in Figure 8b. Because the 
boundaries of both the horizontal occluder and the B are now connected, they can con- 
tain their filled-in activities within the Monocular FIDOs, as shown in Figures 8c and 8d. 

Boundary pruning also helps to explain data about depth/brightness interactions, such 
as: Why do brighter Kanizsa squares look closer (Bradley and Dumais, 1984)? Why is 
boundary pruning relevant here? A Kanizsa square’s brightness is an emergent property 
that is determined after all brightness and darkness inducers fill-in within the square. 
This emergent brightness within the FIDOs then influences the square’s perceived depth. 
Within FACADE, this means that the FIDO’s brightness influences the BCS copies that 
control relative depth. This occurs via the BCS-to-FCS feedback signals, including prun- 
ing, that ensure boundary-surface consistency (Grossberg, 1997, Section 22). 

Visible brightness percepts are not represented within the monocular FIDOs. Model V2 
representations of binocular boundaries and monocular filled-in surfaces are predicted to 
be amodal, or perceptually invisible. These representations are predicted to directly ac- 
tivate object recognition (i.e.. Thinking!) mechanisms in inferotemporal cortex and be- 
yond, since they accurately represent occluding and occluded objects. In particular, 
boundary pruning enables boundaries of occluded objects to be completed within the 
BCS, which makes them easier to recognize, as is illustrated for the Bregman-Kanizsa 
display in Figure 8. The monocular FIDO surface representations fill-in an occluded ob- 
ject within these completed object boundaries, even behind an opaque occluding object. 
We can hereby know the color of occluded regions without seeing them. How, then, do 
we see opaque occluding surfaces? How does the visual cortex generate representations 
of occluding and occluded objects that can be easily recognized, yet also allow us to con- 
sciously see, and reach for, only the unoccluded parts of objects? 

FACADE theory proposes that the latter goal is realized at the binocular FIDOs, which 
process a different combination of boundary and surface representations than is found at 
the monocular FIDOs. The surface representations at the monocular FIDOs are depth- 
selective, but they do not combine brightness and color signals from both eyes. Binocu- 
lar combination of brightness and color signals takes place at the binocular FIDOs, 
which are predicted to exist in cortical area V4. It is here that modal, or visible, surface 
representations occur, and we see only unoccluded parts of occluded objects, except 
when transparent percepts are generated by special circumstances. 

To accomplish binocular surface matching, monocular FCS signals from both eyes 
(pathways 8 in Figure 5) are binocularly matched at the binocular FIDOs. These 
matched signals are redundantly represented on multiple FIDOs. The redundant binoc- 
ular signals are pruned by inhibitory contrast-sensitive signals from the monocular FI- 
DOs (pathways 9 in Figure 5). As in the case of boundary pruning, these surface prun- 
ing signals arise from surface regions that successfully fill-in within the monocular FI- 
DOs. These signals inhibit the FCS signals at their own positions and farther depths. As 
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a result, occluding objects cannot redundantly fill-in surface representations at multiple 
depths. Surface pruning is another example of the asymmetry between near and far. 
Figure 9 illustrates how surface pruning works for the Bregman-Kanizsa image. Figure 
9a shows the signals that initiate filling-in at the near Binocular FIDO, and Figure 9b 
shows them at the far Binocular FIDO. Surface pruning eliminates signals from the oc- 
cluder in Figure 9b. 




Figure 9. Enriched boundary and modal surface representations. Binocular FIDO filling-in signals at (a) near 
depth and (b) far depth. Enriched boundaries at the (c) near depth and (d) far depth. Filled-in Binocular FIDO 
activity consisting of two modal surfaces at two different depths: (e) near depth and (f) far depth. [Reprinted 
with permission from Kelly and Grossberg (2000).] 
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As in the monocular FIDOs, FCS signals to the binocular FIDOs can initiate filling-in 
only where they are spatially coincident and orientationally aligned with BCS bound- 
aries. BCS-to-FCS pathways 10 in Figure 5 carry out depth-selective surface capture of 
the binocularly matched FCS signals that survive surface pruning. In all, binocular FI- 
DOs fill in FCS signals that: (a) survive within-depth binocular FCS matching and 
across-depth FCS inhibition; (b) are spatially coincident and orientationally aligned with 
BCS boundaries; and (c) are surrounded by a connected boundary (web). 

One further property completes this summary: At the binocular FIDOs, nearer bound- 
aries are added to FIDOs that represent their own and farther depths. This asymmetry be- 
tween near and far is called boundary enrichment. Enriched boundaries prevent occlud- 
ing objects from looking transparent by blocking filling-in of occluded objects behind 
them. The total filled-in surface representation across all binocular FIDOs represents the 
visible percept. It is called a FACADE representation because it multiplexes the proper- 
ties of Form-And-Color-And-DEpth that give FACADE theory its name. Figures 9c and 
9d show the enriched near and far Binocular FIDO boundaries, respectively, for the 
Bregman-Kanizsa image. Note the superposition of occluder and occluding boundaries 
in Figure 9d. Figures 9e and 9f show the filled-in near and far modal surface represen- 
tations that the surface signals in Figure 9a and 9b cause within the boundaries of Fig- 
ures 9c and 9d. Note that only the unoccluded surface of the B is “visible” in the Binoc- 
ular FIDO representation, even though the entire B surface is completed within the 
amodal Monocular FIDO representation in Figure 8d. 



KANIZSA STRATIFICATION 

Kanizsa Stratification images (Kanizsa, 1985) can also lead to depthful figure-ground 
percepts (e.g., Figure 10). Here the percept is one of a square weaving over and under 
the cross. This image is interesting because a single globally unambiguous figure-ground 
percept of one object being in front (cross or thin outline square) does not occur. On the 
left and right arms of the cross in Figure 10, the contrastive vertical black lines are cues 
that the outline square is in front of the cross arms. The top and bottom regions consist 
of a homogeneously white figural area, but most observers perceive two figures, the 
cross arms in front of the thinner outline square. This is usually attributed to the fact that 
a thinner structure tends to be perceived behind a thicker one most of the time (Petter, 
1956; Tommasi et al., 1995). The figure-ground stratification percept is bistable through 
time, flipping intermittently between alternative cross-in-front and square-in-front per- 
cepts. Kanizsa used this sort of percept to argue against the Helmholtz “unconscious in- 
ference” account which would not expect interleaving to occur, due to its low probabil- 
ity during normal perceptual experience. FACADE theory uses the same mechanisms as 
above to explain how perceptual stratification of a homogeneously-colored region oc- 
curs, and how the visual system knows which depth to assign the surface color in dif- 
ferent parts of the display. Many other percepts have also been explained by using the 
same small set of concepts and mechanisms. 
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Figure 10. An example of perceptual stratification. [Reprinted with permission from Kanizsa, (1985).] 



An outline of the FACADE explanation is as follows. The thin vertical black lines cre- 
ate T-junctions with the cross. The stems of the T boundaries are broken by bipole feed- 
back, thus separating the thin outline square from the cross (see Figure 11a). At the top 
and bottom arms of the cross, vertical bipole cells link the sections of the cross arms to- 
gether, thereby creating a T-junction with the sections of the square. The vertical bipole 
cells of the cross win out over the horizontal bipole cells of the squares. This happens 
because the cross is wider than the square. Thus vertical bipole cells have more support 
from their receptive fields than do the horizontal bipole cells at the cross-square inter- 
section. The boundaries of the square are hereby inhibited, thereby creating end gaps. As 
a result, the cross arms pop in front and the square is seen behind the cross (Figure lib 
and 11c). 




Figure 11. (a) Near-depth boundaries in response to the Kanizsa stratification image. Binocular filling-in do- 
main activity at the (b) near depth and (c) far depth. 
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The bistability of the stratification percept may be explained in the same way that the 
bistability of the Weisstein effect (Brown and Weisstein, 1988) was explained in Gross- 
berg (1994). This explanation used the habituative transmitters that occur in the path- 
ways 3 between complex cells and hypercomplex cells (Figure 5). Transmitter habitua- 
tion helps to adapt active pathways and thereby to reset boundary groupings when their 
inputs shut off. This transmitter mechanism has been used to simulate psychophysical 
data about visual persistence, aftereffects, residual traces, and metacontrast masking 
(Francis, 1997; Francis and Grossberg, 1996a, 1996b; Francis, Grossberg, and Mingol- 
la, 1994), developmental data about the self-organization of opponent simple cells, com- 
plex cells, and orientation and ocular dominance columns within cortical area VI 
(Grunewald and Grossberg, 1998; Olson and Grossberg, 1998), and neurophysiological 
data about area Vi cells (Abbott, Varela, Sen, and Nelson, 1997). The bistability of the 
stratification percept can hereby be traced to more basic functional requirements of vi- 
sual cortex. 



CONCLUSION 

The present chapter describes how the FACADE theory of 3-D vision and figure- 
ground perception helps to explain some of the most widely known differences between 
seeing and thinking. Along the way, the theory provides explanations of the percepts that 
are generated by some of Kanizsa’ s most famous displays. These explanations gain in- 
terest from the fact that they reflect fundamental organizational principles of how the 
brain sees. In particular, they illustrate some of the complementary properties of bound- 
ary and surface computations in the interblob and blob cortical processing streams of vi- 
sual cortex. 

These insights lead to a revision of classical views about how visual cortex works. In 
particular, visual cortex does not consist of independent processing modules. Rather, hi- 
erarchical and parallel interactions between the boundary and surface streams synthesize 
consistent visual percepts from their complementary strengths and weaknesses. Bound- 
aries help to trigger depth-selective surface filling-in, and successfully filled-in surfaces 
reorganize the global patterning of boundary and surface signals via feedforward and 
feedback signals. Boundary-gated filling-in plays a key role in surface perception, rang- 
ing from lower-level uses, such as recovering surface brightness and color after dis- 
counting the illuminant and filling-in the blind spot, to higher-level uses, such as com- 
pleting depthful modal and amodal surface representations during 3-D vision and figure- 
ground separation. 

Boundary and surface representations activate learned object representations which, in 
turn, prime them via top-down modulatory attentional signals. This priming property 
emphasizes that the visual cortex is not merely a feedforward filter that passively detects 
visual features, as was proposed by many scientists who thought of the visual brain as a 
Fourier filter or as a feedforward hierarchy of bottom-up connections that form increas- 
ingly complex and large-scale receptive fields. Rather, the visual brain is an integrated 
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system of bottom-up, top-down, and horizontal interactions which actively completes 
boundary groupings and fills-in surface representations as its emergent perceptual units. 
This interactive perspective has enabled recent neural models to quantitatively simulate 
the dynamics of individual cortical cells in laminar cortical circuits and the visual per- 
cepts that emerge from their circuit interactions. Such results represent a concrete pro- 
posal for beginning to solve the classical Mind/Body Problem, and begin to do justice to 
the exquisite sensitivity of our visual percepts to the scenes and images through which 
we know the visual world. 

Prof. Stephen Grossberg 

Department of Cognitive and Neural Systems 

Boston University 

677 Beacon Street 

Boston, MA 02215, U.S.A. 
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FUNCTIONAL ARCHITECTURE OF THE VISUAL 
CORTEX AND VARIATIONAL MODELS FOR KANIZSA’S MODAL 
SUBJECTIVE CONTOURS 



INTRODUCTION 

We will present a neuro-geometrical model for generating the shape of Kanizsa’s 
modal subjective contours - that we will call in the following V-contours. It will be 
based on the functional architecture of the primary areas of the visual cortex. 

As we are interested by a mathematical clarification of some very basic phenomena, we 
will restrict ourselves to a very small part of the problem, involving only the functional 
architecture of the first cortical area V 1 . We will see that the model is already quite so- 
phisticated. Many other aspects (e.g., the role of V2) would have of course to be taken 
into account in a more complete model. 



I. TOWARDS VARIATIONAL MODELS OF KANIZSA’S ILLUSORY CONTOURS 

The object under study will not be classical straight A'-contours but curved ones (K- 
curves) where the sides of the internal angles of the pacmen are not aligned (see figure 1). 




Figure 1. An example of a Kanizsa curved illusory modal contour. The second figure shows the well known 
“neon effect” (diffusion of color inside the area bounded by the virtual contours). 



In an important paper, Shimon Ullman (1976) of the MIT AI Lab introduced the key 
idea of variational models. 

“A network with the local property of trying to keep the contours ‘as straight as possi- 
ble’ can produce curves possessing the global property of minimizing total curvature.” 
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He was followed by Horn (1983) who introduced a particular type of curves, the curves 
of least energy. Then in 1992, David Mumford introduced in computer vision, but only 
for amodal contours, a fundamental model based on the physical concept of elastica. 
Elastica are well known in classical Mechanics. They are curves minimizing at the same 
time the length and the integral of the square of the curvature k, i.e. the energy 

E = J(aK+|3) 2 £fo 

where ds is the element of arc length along the curve. 

We will present here a slightly different variational model, based on the concept of “ge- 
odesic curves” in VI and more realistic at the neural level. Let us begin with some ex- 
perimental results. 



II. AN EXPERIMENT ON X-CURVES (WITH JACQUES NINIO) 

With our colleague Jacques Ninio of the Ecole Normale Superieure (Paris) we worked 
out an experiment aiming at measuring the exact position of the extremum of a A'-curve. 
For that purpose, we looked at families of V-curves generated by 



* 







* f f r f 

» * * > > 

• * it it it 



f ** If If ^ 
* J 



f ** If If If ^ 
* ii J 



Figure 2. Curved Kanizsa triangles and squares used for the experiment with J. Ninio. 
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2 configurations: triangle or square; 

2 sizes of configuration; 

2 sizes of pacmen; 

4 orientations; 

5 angles (see figure 2). 

There are different methods for measuring the extremum of a A'-contour. For instance, 
one can use the “subthreshold summation” method: the threshold for the detection of a 
small segment parallel to the ^-contour decreases when the segment is exactly located 
on the ^-contour (see figure 3). 
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Figure 3. The subthreshold summation method: the threshold for the detection of a small segment parallel to 
the Af-contour decreases when the segment is exactly located on the Af-contour (from Dresp, Bonnet, 1955). 



As for us, we used another method for detecting the extremal point of a A'-curve: the sub- 
ject was asked to place a marker (the extremity of an orthogonal line, a small segment, the 
symmetry axis of a small stripe) as exactly as possible at the extremum (see figure 4). 

For different cases (triangle / square and small / large pacmen size) we compare three 
positions: 

1 . the piecewise linear position (intersection of the corresponding sides of the two pac- 
men); 

2. the position chosen by the subjects; 

3. the circle position (extremum of the arc of circle tangent to the sides of the two pac- 
men). 

Let us take for instance the case (see figure 5) of the square with small pacmen (pa- 
rameter ps = pacmen size = 1). The graphics plots the distance d of the extremum of the 
/C-contour to the center of the configuration as a function of the aperture angle (figure 
5b). d is measured by its ratio to the piecewise rectilinear case (which corresponds to 
d = 1) (figure 5a). 5 aperture angles are considered: 

#2 corresponds to the classical case of a straight A'-ccntour (di =1); 

#1 to a slightly convex one (di > di = 1); 

#0 to a more convex one (do > di > d:= 1); 
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Figure 4. The method of detection of the extremal point of a curved Af-contour. The subject is asked to place a 
marker (the extremity of an orthogonal line, a small segment, the symmetry axis of a small stripe) as exactly 
as possible at the extremum. 
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Figure 5. (a) The case of the square with small (parameter ps = pacmen size =1) pacmen. The distance d is the 
distance of the extremum of the Af-contour to the center of the configuration, d is measured by its ratio to the 
piecewise rectilinear case (which corresponds therefore to d = 1). (b) Comparison of three Af-contours: the 
piecewise rectilinear one, the one chosen by the subjects, the circle one. The graphics plots the distance d as a 
function of the aperture angle. 
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#3 to a slightly concave one (d< < ch = 1); 

#4 to a more concave one UL < d: < ch = 1). 

We see that the observed empirical AT-contour is located between the piecewise recti- 
linear one and the circular one, and that the latter is therefore false. 

Another important result is that the deflections for the triangle and for the square are 
not the same. This is a typical global effect (see figure 6) which is very interesting but 
won’t be taken into account here. 



Distance 




Figure 6. The deflections for the triangle and for the square are not the same. 



in. NEURAL FUNCTIONAL ARCHITECTURE 

We want now to work out a neurally plausible model of A'-curves at the V 1 level. We 
need first some results concerning the functional architecture of V 1 . 

It is well known (see for instance De Angelis et al. 1995) that the receptive profile (the 
transfert function) of simple cells of V 1 are like third order derivatives of Gaussians 
(with a well described underlying neural circuitry, from lateral geniculate body to layers 
of VI) (see figure 7). 

Such receptive profiles act on the signal as filters by convolution and process a wavelet 
analysis. 

Due to their structure, the receptive fields of simple cells detect a preferential orienta- 
tion. Simplifying the situation, we can say they detect pairs {a, p) of a spatial (retinal) 
position a and a local orientation p at a. They are organized in small modules called hy- 
percolumns (Hubei and Wiesel) associating retinotopically to each position a of the reti- 
na A a full exemplar P„ of the space of orientations p at a. A very simplified schema of 
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Figure 7. (a) Level curves of the receptive profile of a simple cell of VI (right) and a simple schema of the re- 
ceptive field (left) (from De Angelis et al. 1995). (b) A third derivative of a Gaussian along a direction, (c) The 
level curves. They fit very well with the empirical data (a). 



this structure (with a 1 -dimensional base R ) is shown at the figure 8. It is called a fibra- 
tion of base R and fiber P. 



V 


l\,= 




'(H) 






Figure 8. The general schema of a fibration with base space R, fiber P and total space V. The projection n proj- 
ects V onto R and above every point a of R the fiber Pa is isomorphic to P. 



Pairs (a, p ) are called in geometry contact elements. But, beyond retinotopy formalized 
by the projection n, their set V = {(a, p) } need to be strongly structured to allow the vi- 
sual cortex to compute contour integration. We meet here the problem of the functional 
architecture of V 1 . 
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Recent experiments have shown that hypercolumns are geometrically organized in 
pinwheels. The cortical layer is reticulated by a network of singular points which are the 
centers of the pinwheels, around these singular points all the orientations are distributed 
along the rays of a “wheel”, and the wheels are glued together in a global structure. The 
experimental method is that of in vivo optical imaging based on activity-dependent in- 
trinsic signals (Bonhoffer & Grinvald, 1991) which allows to acquire images of the ac- 
tivity of the superficial cortical layers. Gratings with high contrast and different (e.g., 8) 
orientations are presented many times (20-80) with, e.g., a width of 6.25° for the dark 
strips and of 1.25° for the light ones, and velocity of 22.57s. A window is then opened 
above V 1 and the cortex is illuminated with an orange light. One sums the images of 
Vi’s activity for the different gratings, constructs differential maps, and eliminates the 
low frequency noise. The maps are then normalized (by dividing the deviation relative 
to the mean value at each pixel by the global mean deviation) and the orientations are 
coded by colors (iso-orientation lines become therefore iso-chromatic lines). 

In the celebrated figure 9 due to William Bosking, one can identify three classes of 
points: 

1 . regular points, where the orientation field is locally trivial; 

2. singular points at the center of the pinwheels, where a full set of iso-orientation lines 
converge, two adjacent singular points being of opposed chiralities; 

3. saddle-points at the center of the cells defined by the singular points. 




Figure 9. The pinwheel structure of VI for a tree shrew. The different orientations are coded by colors. Exam- 
ples of regular points and singularities of opposed chiralities are zoomed in. (From Bosking et al. 1997). 



A cristal-like model of such a network of pinwheels is shown at the figure 10. 

As we have already noticed, the fonctional architecture associating retinotopically to 
each position a of the retina R an exemplar P„ of the space of the orientations implements 
a well known geometrical structure, namely the fibration n : RXP—>R with base R and 
fiber P. But such a “vertical” structure idealizing the retinotopic mapping between R and 
P is definitely insufficient. To implement a global coherence, the system must be able to 
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Figure 10. An idealized “cristal-like” model of pinwheels centered on a regular lattice of singular points. Some 
iso-orientations lines are represented. The saddle points in the centers of the domains are well visible. 

compare between them two retinotopically neighboring fibers Pa et Pi, over two neigh- 
boring retinal points a and b. This is a problem of parallel transport whose simplified 
schema is shown at the figure 11 (to be compared with figure 8). It has been solved ex- 
perimentally by the discovery of “ horizontal ” cortico-cortical connections. 




Figure 11. Cortico-cortical horizontal connections allow the system to compare orientations in two different hy- 
percolumns corresponding to two different retinal positions a and b. 



Experiments show that cortico-cortical connections connect neurons of the same ori- 
entation in neighboring hypercolumns. This means that the system is able to know, for 
b near a, if the orientation p at a is the same as the orientation q at b. The retino- 
geniculo-cortical “vertical” connections give an internal meaning to relations between 
contact elements (a, p) and (a, q) ( different orientations p and q at the same point a) 
while the “horizontal” cortico-cortical connections give an internal meaning to rela- 
tions between contact elements (a, p) and (b, p) ( same orientation p at different points 
a and b ) (see figure 12). 
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P 



\ ertical connections : 
a=b 

p*q 




Horizontal connections : 
a*b 
p=q 



Figure 12. While the retino-geniculo-cortical “vertical”’ connections give a meaning to the relations between pairs 
(a, p) and ( a , q) (different orientations p and q at the same point a), the “horizontal”’ cortico-cortical connections 
give a meaning to the relations between pairs (a, p) and ( b , p) (same orientation p at different points a and b). 



Moreover, as is schematized in figure 13, cortico-cortical connections connect not oly 
parallel but also coaxial orientation cells, that is neurons coding contact elements ( a , p) 
and ( b , p) such that p is the orientation of the axis ab. 




a 




P 



Alignement : 

o*b 

p-q-ab 



Figure 13. Coaxial orientation cells. 



As emphasizes William Bosking (1997): 

“The system of long-range horizontal connections can be summarized as preferential- 
ly linking neurons with co-oriented, co-axially aligned receptive fields". 

We will now show that these results mean that what geometers call the contact struc- 
ture of the fibration n : RxP -*R j s neurologically implemented. 



IV. THE CONTACT STRUCTURE OF VI AND THE ASSOCIATION FIELD 

We work in the fibration n : V - RxP >R with base space R and fiber P = set of orien- 
tations p. V is an idealized model of the functional architecture of V 1 . Mathematically, 
n can be interpreted as the fibration A’xP (P = I* = projective line of orientations), or as 
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the fibration Rx S 1 (P = S' = unit circle of the orientation angles 9), or as the space RxR 
of 1-jets of curves C in R (P = R = the real line of the tangent p = tan(0) of the orienta- 
tion angles). In the following, we will use the later model. A coordinate system for V is 
therefore given by triplets (x, y, p) where p = tan(0). 

If C is a regular curve in R (a contour), it can be lifted to V through the map 
r : C *V = RxP wich associates to every point a of C the contact element (a, pf where 
p« is the tangent of C at a. T represents C as the enveloppe of its tangents (see figure 14). 




Figure 14. The lifting of a curve r, y =f(x), in the base space R to the space V. Above every point (x, y =/(*)) 
of r we take the tangent direction p =f'(x). 



If a(s ) is a parametrization of C, we have p„ = a( s) (where a ' symbolizes the derivative 
yXs)/x\sj) and therefore r = ( a(s ), p{s)) = ( a(s ), a{ s)). If y = f{x) is a (local) equation 
of C, then a (local) equation of the lifting r in V is (x, y, p) = (x, y, y). 

To every curve C in R is associated a curve r in V. But the converse is definitely false. 
Indeed, let r = (a(s), p(s)) be a (parametrized) curve in V. The projection a(s) of r is of 
course a curve C in R. But in general r will not be the lifting of its projection C. r will 
be the lifting of C = n( O iffp(s) = a( s). In differential geometry, this condition is called 
a Frobenius integrability condition. Technically, it says that to be a coherent curve in V, 
r must be an integral curve of the contact structure of the fibration n. We show in figure 
15, besides the integrable example of figure 14, three exemples of non integrable curves 
r which are not the lifting of their projection C. 

Geometrically, the integrability condition means the following. Let t = (x, y, p: l,y', pj 
be a tangent vector to V at the point (a, p) = (x, y, p). If p = y ' we get t = (x, y, p\ 1, p, p f. 
It is easy to show that this condition means exactly that t is in the kernel of the 1-form 
CO = dy - pdx. This kernel is a plane, called the contact plane of V at (a, p), and the inte- 
grability condition for a curve r in V says exactly that r is tangent at each of its point 
(a, p) to the contact plane at that point. It is in that sense that r is an integral curve of 
the contact structure of V. 
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Figure 15. The association field as a condition of integrability. (a) The integrability condition is satisfied, (b), 
(c), (d) the condition is not satisfied. In (b) we add a constant angle to the tangent (i.e. p =f\x) + po). In (c) p 
is constant while/' is not. In (d) p rotates faster than f. 



The integrability condition is a version of the Gestalt principle of “good continua- 
tion”. Its psychophysical counterpart has been experimentally analyzed by David 
Field, Anthony Hayes and Robert Hess (1993) and explained using the concept of as- 
sociation field. Let (a,, pi) be a sequence of contact elements embedded in a back- 
ground of distractors. The authors show that they generate a perceptively salient 
curve (pop-out) iff the p ; are tangent to the curve interpolating the This is due to 
the fact that the activation of a simple cell detecting a contact element (a, p) pre- ac- 
tivates via the horizontal cortico-cortical connections, cells detecting contact ele- 
ments (c, q ) with c roughly aligned with a in the direction p and q close to p. The pre- 
activation is strongly enhanced if the cell (c, q) is sandwiched between a cell (a, p) 
and a cell ( b , p) (see figure 16). 




a 




Preactivation of a cell ( c,p ) 



Figure 16 
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The pop-out of the curve generated by the (a,-, pi) is a typical Gestalt phenomenon 
which results from a binding induced by the co-activation. It manifests a very particular 
type of grouping. As was emphasized by Field, Hayes, and Hess (1993): 

“Elements are associated according to joint constraints of position and orientation.” (p. 
187) 

“The orientation of the elements is locked to the orientation of the path; a smooth curve 
passing through the long axis can be drawn between any two successive elements.” (p. 
181) 

This is clearly a discrete version of the integrability condition. 



V. A VARIATIONAL MODEL OF MODAL KANIZSA ILLUSORY CONTOURS 

In such a framework, we can solve some aspects of the Kanizsa problem in a princi- 
pled way. Two pacmen of respective centers a and b with a specific aperture angle de- 
fine two contact elements A = (a, p) and B = (/;, q) of V. A AT-curve interpolating between 
A and B is 

1. a curve C from a to b in R with tangent p at a and tangent q at b; 

2. a curve minimizing a sort of “energy” (variational problem). 

But as far as the integration of C is processed in V, we must lift the problem to V. We 
must therefore find in V a curve T interpolating between (a, p) and (b, q ) in V, and wich 
is at the same time: 

1 . “as straight as possible”, that is “geodesic” in V; 

2. an integral curve of the contact structure. 

In general r will not be a straight line because it will have to satisfy the integrability 
condition. It will be “geodesic” only in the class of integral cun’es. 

Mathematically, the problem is not trivial. We have to solve constrained Euler-La- 
grange equations in the jet space V. We must first define appropriate Lagrangians on V 
based on Riemannian metrics which reflect the weakening of the horizontal cortico-cor- 
tical connections when the discrepancy between the boundary values 0 a and 0s increas- 
es. If the angle 0 is measured relatively to the axis AB (0 has therefore an intrinsic geo- 
metric meaning), the weakening must vanish for 0 = 0 and 0 = jc and diverge for 0 = n/2. 
The function p = f'= tan0 being the simplest function sharing this properties, it seems 
justified to test first the Euclidean metric of V. We will use a frame Ox y of R where the 
x-axis is identified with AB. The invariance under a change of frame is expressed by the 
action of the Euclidean group E(2) on V. 

We look therefore for curves of minimal length in V among those which lift curves in 
R. that is which satisfy the Frobenius integrability condition and are integrals of the con- 
tact structure. We will call them “Legendrian geodesics”. Let (x, y, p: q, Tj, n) be coordi- 
nates in the tangent space TV of V. We have to minimize the length of T expressed by 
the fonctional \ f ds where the element of arc length ds is given by ds 2 = dx 2 + dy 2 + dp 2 . 
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The energy to minimize is therefore E = j*® L(x)dx where the Lagrangien L is given, for 
a curve T of the form (x, y = fix), p = f'(x)), by the formula L(x)dx = ds , that is by: 



m = + + 7tWi+/'M 2 +r(x) 2 

We have to solve the Euler-Lagrange (E-L) equations constrained by the integrability 
constraint p = f'(x), i.e. Z = 0 with Z = p - T|. These constrained E-L equations are : 



_8 d d 

dy dx dp 

_d d 3 

dii dx dn 

V 



(L + XL) = 0 



(L + XL) = 0 



where \(x) is a function, called a Lagrange multiplier. The idea is that the E-L equations 
with the constraint £ = 0 are the same as the non constrained E-L equations for the mod- 
ified Lagrangian L + XL. 

After some tedious computations, we get the following differential equations for the 
function p = /', where c and d are two integration constants: 



1 + pixf = {cpix) + d)\Jl+pixf + p\xf 



As the solution is given by an elliptic integral, Legendrian geodesics are integrals of el- 
liptic functions. 

We can greatly simplify the solution of the equation when the function /is even, and 
the curve r symmetric under the symmetry x — ► - x. Indeed, this condition implies im- 
mediately c = 0, whence, putting k = i/d , the simpler differential equation for p =/': 

(pj = (1 +P 2 )[k\\+ P 2 ) -1] 

The parameter k is correlated to curvature: k 2 - 1 = k(0) 2 . 

We get therefore: 



X = cst + Jf 



-dt 



(1+r 2 ) 



1 + 



k 2 - 1 



which is a well known elliptic integral of the first kind. 
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We show in figure 17 how the solution/evolves when k varies from 1 to 1.65 by steps 
of 0.5. 

Figure 18 shows in the base space R and in the jet space V how the Legendrian geo- 
desic corresponding to k = 1 .5 is situated relatively to the arc of circle, the arc of parabo- 
la and the piecewise linear solution defined by the same boundary conditions. 




Figure 17. Evolution of Legendrian geodesics when the boundary tangents become more and more vertical, (a) 
Curves C in the base space R. (b) Curves Fin the jet space V. 




Figure 18. Position of the Legendrian geodesic ( k = 1.5) relatively to the arc of circle, the arc of parabola and 
the piecewise linear solution defined by the same boundary conditions, (a) In the base space R. (b) In the jet 
space V. 



The following table shows that the geodesic minimizes the length: 



Curves 


Geodesic 


Arc of circle 


Arc of parabola 


Piecewise linear 


Length 


7.02277 


7.04481 


7.50298 


12.9054 
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CONCLUSION 

Due to the very strong geometrical structure of the functional architecture (hyper- 
columns, pinwheels, horizontal connections), the neural implementation of Kanizsa’s 
contours is deeply linked with sophisticated structures belonging to what is called con- 
tact geometry and with variational models analogue to models already well known in 
physics. 

Prof. Jean Petitot 
CREA 

1 rue Descartes 
75005, Paris, France 
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GESTALT THEORY AND COMPUTER VISION 



1. INTRODUCTION 

The geometric Gestalt theory started in 1921 with Max Wertheimer's founding paper 
[31]. In its 1975 last edition, the Gestalt Bible Gesetze des Sehens by Wolfgang Metzger 
[22] gave a broad overview of the results of a fifty years research. At about the same 
date, Computer Vision was an emerging new discipline, at the meeting point of Artifi- 
cial Intelligence and Robotics. The foundation of signal sampling theory by Claude 
Shannon [28] was already twenty years old, but computers were able to deal with im- 
ages with some efficiency only at the beginning of the seventies. Two things are notice- 
able: 

• Computer Vision did not use at first the Gestalt theory results: the founding book of 
David Marr [21] involves much more neurophysiology than phenomenology. Also, its pro- 
gramme and the robotics programme [12] founded their hopes on binocular stereo vision. 
This was in total contradiction to, or ignorance of the results explained at length in Met- 
zger's chapters on Tiefensehen. Indeed, these chapters demonstrate that binocular stereo vi- 
sion is a parent pauvre in human depth perception. 

• Conversely, Shannon’s information theory does not seem to have influenced Gestalt 
research, as far as we can judge from Kanizsa’s and Metzger’s books. The only bright 
exception is Attneave’s attempt [2] to give a shape sampling theory adapted to shape per- 
ception. 

This lack of initial interaction is surprising. Indeed, both disciplines have attempted to an- 
swer the following question: how to arrive at global percepts (let them be visual objects or 
gestalts 1 ) from the local, atomic information contained in an image? 

In this paper, we shall propose an analysis of the Wertheimer programme adapted to 
computational issues. We shall distinguish two kinds of laws: 

- the practical grouping laws (like vicinity or similarity), whose aim it is to build up 
partial gestalts, 

- the gestalt principles like masking or articulazione senza resti, whose aim it is to op- 
erate a synthesis between the partial groups obtained by the elementary grouping laws. 
We shall review some recent methods proposed by the authors of the present paper in the 
computation of partial gestalts (groups obtained by a single grouping law). These results 
show that 

• there is a simple computational principle (the so-called Helmholtz principle), inspired 
from Kanizsa’s masking by texture, which allows one to compute any partial gestalt ob- 
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tainable by a grouping law (section 4). Also, a particular use of the cirticulazione senza 
resti, which we call maximally, yields optimal partial gestalts; 

• this computational principle can be applied to a fairly wide series of examples of par- 
tial gestalts, namely alignments, clusters, boundaries, grouping by orientation, size or 
grey level; 

• the experiments yield evidence that in natural world images, partial gestalts often col- 
laborate. 

We push one of the experiments to prove that the partial gestalt recursive building up 
can be led up to the third level (gestalts built by three successive grouping principles). In 
contrast, we also show by numerical counterexamples that ah partial gestalts are likely to 
lead to wrong scene interpretations. As we shall see, the wrong detections are always ex- 
plainable by a conflict between gestalts. We eventually show some experiment suggest- 
ing that Minimal Description Length principles [26] may be adequate to resolve some of 
the conflicts between gestalt laws. 

Our plan is as follows. We start in section 2 with an account of Gestalt theory, centered 
on the initial 1923 Wertheimer programme. In section 3, we focus on the problems raised 
by the synthesis of groups obtained by partial grouping laws: we address the conflicts 
between these laws and the masking phenomenon, which we discuss in the line of 
Kanizsa. In section 4, we point out several quantitative aspects implicit in Kanizsa's def- 
inition of masking and show that one particular kind of masking, Kanizsa's masking by 
texture , suggests computational procedures. Such computational procedures are ex- 
plained in section 5. We end this paper in section 6 by the discussion of a list of numer- 
ical experiments on digital images. 



2. GROUPING LAWS AND GESTALT PRINCIPLES 
2.1 GROUPING LAWS 

Gestalt theory starts with the assumption of active “grouping” laws in visual perception 
(see [14], [31]). These groups are identifiable with subsets of the retina. We shall talk in 
the following of “points” or groups of points which we identify with spatial parts of the 
planar rough percept. In image analysis, we shall identify them as well with the points of 
the digital image. Whenever points (or previously formed groups) have one or several 
characteristics in common, they get grouped and form a new, larger visual object, a gestalt. 
The list of elementary grouping laws given by Gaetano Kanizsa in Grammatica del Vedere 
page 45 and following [14] is vicinanza, somiglianza, continuita di direzione, completa- 
mento cimodcile, chiusura, larghezza constante, tendenza alia convessita, simmetria, movi- 
mento solidale, esperienza passata, that is: vicinity, similarity, continuity of direction, 
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amodal completion, closure, constant width, tendency to convexity, symmetry, common 
motion, past experience. This list is actually very close to the list of grouping laws consid- 
ered in the founding paper of Wertheimer [31]. These laws are supposed to be at work for 
every new percept. The amodal completion, one of the main subjects of Kanizsa's books, 
is, from the geometric viewpoint, a variant of the good continuation law 2 . Figure 1 illus- 
trates many of the grouping laws stated above. The subjects asked to describe briefly such 
a figure give an account of it as “three letters X” built in different ways. 




Figure 1. The building up of gestalt: X-shapes. Each one is built up with branches which are themselves groups 
of similar objects; the objects, rectangles or circles are complex gestalts, since they combine color constancy, 
constant width, convexity, parallelism, past experience etc. 



Most grouping laws stated above work from local to global. They are of mathematical 
nature, but must actually be split into more specific grouping laws to receive a mathe- 
matical and computational treatment: 

• Vicinity for instance can mean: connectedness ( i.e . spots glued together) or clusters 
(spots or objects which are close enough to each other and apart enough from the rest 
to build a group). This vicinity gestalt is at work in all sub-figures of figure 2. 

• similarity can mean: similarity of color, shape, texture, orientation,... Each one of 
these gestalt laws is very important by itself (see again figure 2). 

• continuity of direction can be applied to an array of objects (figure 2 again). Let us add 
to it alignments as a grouping law by itself (constancy of direction instead of continuity of 
direction). 

• constant width is also illustrated in the same figure 2 and is very relevant for draw- 
ings and all kinds of natural and artificial forms. 

• Notice, in the same spirit, that convexity, also illustrated, is a particularization of both 
closure and continuity of direction. 
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• past experience: In the list of partial gestalts which are looked for in any image, we 
can have generic shapes like circles, ellipses, rectangles, and also silhouettes of familiar 
objects like faces, cats, chairs, etc. 




Figure 2. Illustration of gestalt principles. From left to right and top to bottom: color constancy + proximity, 
similarity of shape and similarity of texture; good continuation; closure (of a curve); convexity; parallelism; 
amodal completion (a disk seen behind the square ); color constancy; good continuation again (dots building a 
curve); closure (of a curve made of dots); modal completion: we tend to see a square in the last figure and its 
sides are seen in a modal way (subjective contour). Notice also the texture similarity of the first and last fig- 
ures. Most of the figures involve constant width. In this complex figure, the sub-figures are identified by their 
alignment in two rows and their size similarity. 



All of the above grouping laws belong, according to Kanizsa, to the so called processo 
primario (primary process), opposed to a more cognitive secondary process. Also, it may 
of course be asked why and how this list of geometric qualities has emerged in the course 
of biological evolution. Brunswick and Kamiya [4] were among the first to suggest that the 
gestalt grouping laws were directly related to the geometric statistics of the natural world. 
Since then, several works have addressed from different points of views these statistics and 
the building elements which should be conceptually considered in perception theory, 
and/or numerically used in Computer Vision [3], [25], [10], 

The grouping laws usually collaborate to the building up of larger and larger objects. 
A simple object like a square, whose boundary has been drawn in black with a pencil on 
a white sheet, will be perceived by connectedness (the boundary is a black line), by con- 
stant width (of the stroke), convexity and closure (of the black pencil stroke), parallelism 
(between opposite sides), orthogonality (between adjacent sides), again constant width 
(of both pairs of opposite sides). 

We must therefore distinguish between global gestalt and the partial gestalts. The 
square is a global gestalt, but it is the synthesis of a long list of concurring local group- 
ings, leading to parts of the square endowed with some gestalt quality. Such parts we 
shall call partial gestalts. 
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Notice also that all grouping gestalt laws are recursive: they can be applied first to atom- 
ic inputs and then in the same way to partial gestalts already constituted. Let us illustrate 
this by an example. In figure 3, the same partial gestalt laws, namely alignment, paral- 
lelism, constant width and proximity, are recursively applied not less than six times: the 
single elongated dots first aligned in rows, these rows in groups of two parallel rows, 
these groups again in groups of five parallel horizontal bars, these groups again in groups 
of six parallel vertical bars. The final groups appear to be again made of two macroscop- 
ic horizontal bars. The whole organization of such figures is seeable at once. 



Figure 3. Recursiveness of gestalt laws: here, constant width and parallelism are applied at different levels in 
the building up of the final group not less than six times, from the smallest bricks which are actually complex 
gestalts, being roughly rectangles, up to the final rectangle. Many objects can present deeper and more com- 
plex constructions. 



2.2 Global Gestalt principles 

While the partial, recursive, grouping gestalt laws do not bring so much doubt about 
their definition as a computational task from atomic data, the global gestalt principles are 
by far more challenging. For many of them, we do not even know whether they are prop- 
erly constitutive laws or rather an elegant way of summarizing various perception 
processes. They constitute, however, the only cues we have about the way the partial 
gestalt laws could be derived from a more general principle. On the other hand, these 
principles are absolutely necessary in the description of the perception process, since 
they should fix the way grouping laws interact or compete to create the final global per- 
cepts, the final gestalts. Let us go on with the gestalt principles list which can be ex- 
tracted from [14]. We have: 
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raggruppamento secondo la direzionalitd della struttura (Kanizsa, Grammatica del 
Vedere, op. cit., page 54): inheritance by the parts of the overall group direction. This is 
a statement which might find its place in Platon's Parmenides: “the parts inherit the 
whole's qualities”. 

pregnancy, structural coherence, unity (pregnanza, coerenza strutturale, carattere uni- 
tario, ibidem, page 59), tendency to maximal regularity (ibidem, p. 60), articulation 
whole-parts, (in German, Gliederung), articulation without remainder (ibidem p. 65): 
These seven Gestalt laws are not partial gestalts; in order to deal with them from the 
computer vision viewpoint, one has to assume that all partial grouping laws have been 
applied and that a synthesis of the groups into the final global gestalts must be thereafter 
performed. Each principle describes some aspect of the synthesis made from partial 
grouping laws into the most wholesome, coherent, complete and well-articulated per- 
cept. 



3. CONFLICTS OF PARTIAL GESTALTS AND THE MASKING PHENOMENON 

With the computational discussion to come in mind, we wish to examine the relation- 
ship between two important technical terms of Gestalt theory, namely conflicts and 
masking. 



3.1 CONFLICTS 

The gestalt laws are stated as independent grouping laws. Now, they start from the 
same building elements. Thus, conflicts between grouping laws can occur and therefore 
conflicts between different interpretations, that is, different possible groups in a given 
figure. Three cases: 

a) two grouping laws act simultaneously on the same elements and give raise to two 
overlapping groups. It is not difficult to build figures where this occurs, as in figure 4. 
In that example, we can group the black dots and the white dots by similarity of color. 
All the same, we see a rectangular grid made of all the black dots and part of the white 
ones. We also see a good continuing curve, with a loop, made of white dots. These 
groups do not compete. 

b) two grouping laws compete and one of them wins, the other one being inhibited. 
This case is called masking and will be discussed thoroughly in the next section. 

c) conflict: in that case, both grouping laws are potentially active, but the groups can- 
not exist simultaneously. In addition, none of the grouping laws wins clearly. Thus, the 
figure is ambiguous and presents two or more possible interpretations. 
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Figure 4. Gestalt laws in simultaneous action whithout conflict: the white dots are elements of the grid (align- 
ment, constant width) and simultaneously participate of a good continuing curve. 



A big section of Kanizsa's second chapter [14], is dedicated to conflicts of gestalts. Their 
study leads to the invention of clever figures where an equilibrium is maintained between 
two conflicting gestalt laws struggling to give the final figure organization. The viewer can 
direct his attention in both ways, see both organizations and perceive their conflict. A sem- 
inal experiment due to Wertheimer 3 gives an easy way to construct such conflicts. In fig- 
ure 5, we see on the left a figure made of rectangles and ellipses. The prominent grouping 
laws are: a) shape similarity (Li), which leads us to group the ellipses together and the rec- 
tangles as two conspicuous groups; b) the vicinity law L 2 , which makes all of these ele- 
ments build anyway a unified cluster. Thus, on the left figure, both laws coexist without re- 
al conflict. On the right figure instead, two clusters are present. Each one is made of het- 
erogeneous shapes, but they fall apart enough to enforce the splitting of the ellipses group 
and of the rectangles group. Thus, on the right, the vicinity law L 2 tends to win. Such fig- 
ures can be varied, by changing (e.g.) progressively the distance between clusters until the 
final figure presents a good equilibrium between conflicting laws. 

Some laws, like good continuation, are so strong that they almost systematically win, 
as is illlustrated in figure 6. In this figure, two figures with a striking axial symmetry are 
concatenated in such a way that their boundaries are put in “good continuation”. The re- 
sult is a different 
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Figure 5. Conflict of similarity of shapes with vicinity. We can easily view the left hand figure as two groups by 
shape similarity, one made of rectangles and the other one of ellipses. On the right, two different groups emerge 
by vicinity. Vicinity “wins” against similarity of shapes. 
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interpretation, where the symmetric figures literally disappear. This is a conflict, but with 
a total winner. It therefore enters into the category of masking. 






Figure 6. A “ conflict of gestalts” : two overlapping closed curves or, as suggested on the right, two symmetric 
curves which touch at two points? We can interpret this experiment as a masking of the symmetry gestalt law 
by the good continuation law. (From Kanizsa, Grammatica del Vedere p 195, op. cit.) 



3.2 MASKING 

Masking is illustrated by a lot of puzzling figures, where partial gestalts are literally 
hidden by other partial gestalts giving a better global explanation of the final figure. The 
masking phenomenon is the outcome of a conflict between two partial gestalts Li and L 
struggling to organize a figure. When one of them, Li, wins, a striking phenomenon oc- 
curs: the other possible organization, which would result from L 2 , is hidden. Only an ex- 
plicit comment to the viewer can remind her of the existence of the possible organiza- 
tion under L 2 : the parts of the figure which might be perceived by L: have become in- 
visible, masked in the final figure, which is perceived under Li only. 

Kanizsa considers four kinds of masking: masking by embedment in a texture; mask- 
ing by addition (the Gottschaldt tecnique); masking by substruction (the Street tech- 
nique)', masking by manipulation of the figure-background articulation (Rubin, many fa- 
mous examples by Escher). The first technique we shall consider is masking in texture. 
Its principle is: a geometrically organized figure is embedded into a texture, that is, a 
whole region made of similar building elements. This masking may well be called em- 
beddedness as suggested by Kanizsa 4 . Figure 7 gives a good instance of the power of this 
masking, which has been thoroughly studied by the schools of Beck and Juslesz [13]. In 
this clever figure, the basis of a triangle is literally hidden in a set of parallel lines. We 
can interpret the texture masking as a conflict between an arbitrary organizing law L 2 and 
the similarity law, Li. The masking technique works by multiple additions embedding a 
figure F organized under some law L 2 into many elements which have a shape similar to 
the building blocks of F. 
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Figure 7. Masking by embedding in a texture. The basis of the triangle becomes invisible as it is embedded in 
a group of parallel lines. (Galli and Zama, quoted in Vedere e pensare, op. cit.). 



The same proceeding is at work in figure 8. In that figure, one sees that a curve made 
of roughly aligned pencil strokes is embedded in a set of many more parallel strokes. 



Figure 8. Masking by embedding in a texture again. On the right, a curve created from strokes by “good con- 
tinuation ”. This curve is present, but masked on the left. We can consider it as a conflict between L 2 , “good 
continuation ” and Lr. similarity of direction. The similarity of direction is more powerful, as it organizes the 
full figure (articulazione senza resti). 



In the masking by addition technique, due to Gottschaldt, a figure is concealed by ad- 
dition of new elements which create another and more powerful organization. Here, Li 
and L 2 can be any organizing law. In figure 9, an hexagon is thoroughly concealed by the 
addition to the figure of two parallelograms which include in their sides the initial sides 
of the hexagon. Noticeably, the "winning laws” are the same which made the hexagon 
so conspicuous before masking, namely closure, symmetry, convexity and good contin- 
uation. 

As figure 10 shows, Li and L: can revert their roles. On the right, the curve obtained by 
good continuation is made of perfect half circles concatenated. This circular shape is 
masked in the good continuation. Surprisingly enough, the curve on the left is present in 
the figure on the right, but masked by the circles. Thus, on the left, good continuation 
wins against the past experience of circles. On the right, the converse occurs; convexity, 
closure and circularity win against good continuation and mask it. The third masking 
technique considered by Kanizsa is substruction (Street technique), that is, removal of 
parts of the figure. As is apparent in figure 11, where a square is amputated in three dif- 
ferent ways, the technique results effective only 
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Figure 9. Masking by concealment (Gottschaldt 1926). The hexagon on the left is concealed in the figure on 
the right, and more concealed below. The hexagon was built by the closure, symmetry, convexity gestalt laws. 
The same laws are active to form the winner figures, the parallelograms. 




Figure 10. Masking of circles in good continuation, or, conversely, masking of good continuation by closure 
and convexity. We do not really see arcs of circles on the left, although significant and accurate parts of circles 
are present: we see a smooth curve. Conversely, we do not see the left “good” curve as a part of the right fig- 
ure. It is nonetheless present in it. 



when the removal creates a new gestalt. The square remains in view in the third figure 
from the left, where the removal has been made at random and is assimilable to a ran- 
dom perturbation. In the second and fourth figure, instead, the square disappears, al- 
though some parts of its sides have been preserved. 




w * 




Figure 11. Masking by the Street substraction technique (1931), inspired from Kanizsa (Vedere e pensare p 176, 
op. cit.). Parts are removed from the black square. When this is done in a coherent way, it lets appear a new 
shape (a rough cross in the second subfigure, four black spots in the last one) and the square is masked. It is 
not masked at all in the third, though, where the removal has been done in a random way and does not yields 
a competing interpretation. 
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We should not end this section without considering briefly the last category of masking 
mentioned by Kanizsa, the masking by inversion of the figure-background relationship. This 
kind of masking is well known thanks to the famous Escher drawings. Its principle is “the 
background is not a shape” ( ilfondo non e forma). Whenever strong gestalts are present in 
an image, the space between those conspicuous shapes is not considered as a shape, even 
when it has itself a familiar shape like a bird, a fish, a human profile. Again here, we can in- 
terpret masking as the result of a conflict of two partial gestalt laws, one building the form 
and the other one, the loser, not allowed to build the background as a gestalt. 



4. QUANTITATIVE ASPECTS OF GESTALT THEORY 

In this section, we open the discussion on quantitative laws for computing partial 
gestalts. We shall first consider some numerical aspects of Kanizsa's masking by texture. 
In continuation, we shall make some comments on Kanizsa's paradox and its answer 
pointing out the involvement of a quantitative image resolution. These comments lead to 
Shannon's sampling theory. 



4.1 QUANTITATIVE ASPECTS OF THE MASKING PHENOMENON 

In his fifth chapter of Vedere e pensare , Kanizsa points out that “it is licit to sustain that 
a black homogeneous region contains all theoretically possible plane figures, in the same 
way as, for Michelangelo, a marble block virtually contains all possible statues”. Thus, 
these virtual statues could be considered as masked. This is the so called Kanizsa para- 
dox. Figure 12 shows that one can obtain any simple enough shape by pruning a regular 
grid of black dots. In order 




Figure 12. According to Kanizsa's paradox, the figure on the right is potentially present in the figure on the 
right, and would indeed appear if we colored the corresponding dots. This illustrates the fact that the figure on 
the left contains a huge number of possible different shapes. 

to go further, it seems advisable to the mathematician to make a count: how many squares 
could we see, for example, in this figure? Characterizing the square by its upper left cor- 
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ner and its side length, it is easily computed that the number of squares whose corners lie 
on the grid exceeds 400. The number of curves with “good continuation” made of about 
20 points like the one drawn on the right of figure 12 is equally huge. We can estimate it 
in the following way: we have 80 choices for the first point, and about five points among 
the neighbors for the second point, etc. Thus, the number of possible good curves in our 
figure is grossly 80 * 5 20 if we accept the curve to turn strongly, and about 80 * 3 20 if we 
ask the curve to turn at a slow rate. In both cases, the number of possible “good” curves 
in the grid is very large. 

This multiplicity argument suggests that a grouping law can be active in an image, on- 
ly if its application would not create a huge number of partial gestalts. Or, to say it in an- 
other way, we can sustain that the multiplicity implies a masking by texture. Masking of 
all possible good curves in the grid of figure 12 occurs, just because too many such 
curves are possible. 

In the above figure 8 (subsection 3.2), we can repeat the preceding quantitative argument. 
In this figure, the left hand set of strokes actually contains, as an invisible part, the array of 
strokes on the right. This array of strokes is obviously organized as a curve (good contina- 
tion gestalt). This curve becomes invisible on the left hand figure, just because it gets en- 
dowed in a more powerful gestalt, namely parallelism (similarity of direction). As we shall 
see in the computational discussion, the fact that the curve has been masked is related to 
another fact which is easy to check on the left hand part of the figure: one could select on 
the left many curves of the same kind as the one given on the right. 

In short, we do not consider Kanizsa's paradox as a difficulty to solve, but rather as an ar- 
row pointing towards the computational formulation of gestalt: In section 5, we shall de- 
fine a partial gestalt as a structure which is not masked in texture. 
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Figure 13. Below, an array of roughly aligned segments. Above, the same figure is embedded into a texture is 
such a way that it still is visible as an alignment. We are in the limit situation associated with Vicario's propo- 
sition: “is masked only what can be unmasked 
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We shall therefore not rule out the extreme masking cases, in contradiction with Vic- 
ario's principle “e mascherato solo cid che pud essere smascherato ” (is masked only 
what can be unmasked). Clearly, all psychophysical masking experiments must be close 
enough to the “conflict of gestalts” situation, where the masked gestalt still is attainable 
when the subject's attention is directed. Thus, psychological masking experiments must 
remain close to the non masking situation and therefore satisfy Vicario's principle. From 
the computational viewpoint instead, figures 12 and 8 are nothing but very good mask- 
ing examples. 

In this masking issue, one feels the necessity to pass from qualitative to quantitative ar- 
guments: a gestalt can be more or less masked. How to compute the right information to 
quantize this "more or less"? It is actually related to a precision parameter. In figure 13, 
we constructed a texture by addition from the alignment drawn below. Clearly, some 
masking is at work and we would not notice immediately the alignment in the texture if 
our attention was not directed. All the same, the alignment remains somewhat conspic- 
uous and a quick scan may convince us that there is no other alignment of such an ac- 
curacy in the texture. Thus, in that case, alignment is not masked by parallelism. Now, 
one can suspect that this situation can be explained in yet quantitative terms: the preci- 
sion of the alignment matters here and should be evaluated. Precision will be one of the 
three parameters we shall use in computational gestalt. 



4.2 SHANNON THEORY AND THE DISCRETE NATURE OF IMAGES 

The preceding subsection introduced two of the parameters we shall have to deal with 
in the computations, namely the number of possible partial gestalts and a precision pa- 
rameter. Before proceeding to computations, we must discuss the rough datum itself, 
namely the computational nature of images, let them be digital or biological. Kanizsa ad- 
dresses briefly this problem in the fifth chapter of Vedere e pensare, in his discussion of 
the masking phenomenon: “We should not consider as masked elements which are too 
small to attain the visibility threshold”. Kanizsa was aware that the amount of visible 
points in a figure is finite 5 . He explains in the same chapter why this leads to work with 
figures made of dots; we can consider this decision as a way to quantize the geometric 
information. 

In order to define mathematically an image, be it digital or biological, in the simplest 
possible way, we just need to fix a point of focus. Assume all photons converging to- 
wards this focus are intercepted by a surface which has been divided into regular cells, 
usually squares or hexagons. Each cell counts its number of photons hits during a fixed 
exposure time. This count gives a grey level image, that is, a rectangular, (roughly cir- 
cular in biological vision) array of grey level values on a grid. In the case of digital im- 
ages, C.C.D. matrices give regular grids made of squares. In the biological case, the 
retina is divided into hexagonal cells with growing sizes from the fovea. Thus, in all 
cases, a digital or biological image contains a finite number of values on a grid. Shan- 
non [28] made explicit the mathematical conditions under which, from this matrix of 
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values, a continuous image can be reconstructed. By Shannon's theory, we can compute 
the grey level at all points, and not only the points of the grid. Of course, when we zoom 
in the interpolated image it looks more and more blurry: the amount of information in 
a digital image is indeed finite and the resolution of the image is bounded. The points 
of the grid together with their grey level values are called pixels, an abbreviation for 
picture elements. 

The pixels are the computational atoms from which gestalt grouping procedures can 
start. Now, if the image is finite, and therefore blurry, how can we infer sure events as 
lines, circles, squares and whatsoever gestalts from discrete data? If the image is 
blurry all of these structures cannot be inferred as completely sure; their exact location 
must remain uncertain. This is crucial: all basic geometric information in the image has 
a precision 6 . Figure 13 shows it plainly. It is easy to imagine that if the aligned segments, 
still visible in the figure, are slightly less aligned, then the alignment will tend to disap- 
pear. This is easy cheeked with figure 14, where we moved slightly up and down the 
aligned segments. 

Let us now say briefly which local, atomic, information can be the starting point of com- 
putations. Since every local information about a function u at a point (x, y ) boils down to 
its Taylor expansion, we can assume that these atomic informations are: 

• the value uix, y) of the grey level at each point (x, y) of the image plane. Since the 
function u is blurry, this value is valid at points close to (x, y). 
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Figure 14. When the alignment present in figure 13 is made less accurate, the masking by texture becomes more 
efficient. The precision plays a crucial role in the computational gestalt theory outlined in the next section. 



• the gradient of u at (x, y), the vector 
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• the orientation at (x, y). 
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This vector is visually intuitive, since it is tangent to the boundaries one can see in an 
image. 

These local informations are known at each point of the grid and can be computed at 
any point of the image by Shannon interpolation. They are quantized, having a finite 
number of digits, and therefore noisy. Thus, each one of the preceding measurements has 
an intrinsic precision. The orientation is invariant when the image contrast changes 
(which means robustness to illumination conditions). Attneave and Julesz [13] refer to it 
for shape recognition and texture discrimination theory. Grey level, gradient and orien- 
tation are the only local information we shall retain for the numerical experiments of the 
next section, together with their precisions. 



5. COMPUTING PARTIAL GESTALTS IN DIGITAL IMAGES 

In this section, we shall summarize a computational theory which permits to find au- 
tomatically partial gestalts in digital images. This theory essentially predicts percep- 
tion thresholds which can be computed on every image and give a usually clear cut de- 
cision between what is seeable as a geometric structure (gestalt) in the image and what 
is not. Those thresholds are computable thanks to the discrete nature of images. Many 
more details can be found in [9], All computations below will involve three funda- 
mental numbers, whose implicit presence in Gestalt theory has just been pointed out, 
namely 

- a relative precision parameter p which we will treat as a probability; 

- a number AW of possible configurations for the looked for partial gestalt. This num- 
ber is finite because the image resolution is bounded; 

- The number N of pixels of the image. 

Of course, p and AW will depend upon the kind of gestalt grouping law under consid- 
eration. We can relate p and N to two fundamental qualities of any image: its noise and 
its blur. 




86 



AGNES DESOLNEUX, LIONEL MOISAN AND JEAN-MICHEL MOREL 



5.1 A GENERAL DETECTION PRINCIPLE 

In [6], [7], [8], we outlined a computational method to decide whether a given partial 
gestalt (computed by any segmentation or grouping method) is reliable or not. As we 
shall recall, our method gives absolute thresholds, depending only on the image size, 
permitting to decide when a given gestalt is perceptually relevant or not. 

We applied a general perception principle which we called Helmholtz principle (figure 
15). This principle yields computational grouping thresholds associated with each gestalt 
quality. It can be stated in the following generic way. Assume that atomic objects 0\,0i, ..., 
On are present in an image. Assume that k of them, say 0\, ..., Ok have a common fea- 
ture, say, same color, same orientation, position etc.. We are then facing the dilemma: is 
this common feature happening by chance or is it significant and enough to group O i,... 
Okl In order to answer this question we make the following mental experiment: we as- 
sume a priori that the considered quality has been randomly and uniformly distributed 
on all objects O i,... On Then we (mentally) assume that the observed position of objects 
in the image is a random realization of this uniform process. We finally ask the question: 
is the observed repartition probable or not ? If not, this proves a contrario that a group- 
ing process (a gestalt) is at stake. Helmholtz principle states roughly that in such mental 
experiments, the numerical qualities of the objects are assumed to be equally distributed 
and independent. Mathematically, this can be formalized by 

Definition 1 (e-meaningful event [6]) We say that an event of type “such configura- 
tion of geometric objects has such property” is E-meaningful if the expectation of the 
number of occurrences of this event is less than £ under the uniform random assumption. 

As an example of generic computation we can do with this definition, let us assume 
that the probability that a given object O, has the considered quality is equal to p. Then, 
under the independence assumption, the probability that at least k objects out of the ob- 
served n have this quality is 



n 



B(p, n,k) = ^ ^ ^ 



pa -pt\ 



i.e. the tail of the binomial distribution. In order to get an upper bound for the number of 
false alarms, i.e. the expectation of the number of geometric events happening by pure 
chance, we can simply multiply the above probability by the number of tests we perform 
on the image. This number of tests AWcorresponds to the number of different possible po- 
sitions we could have for the searched gestalt. Then, in most cases we shall consider in the 
next subsections, a considered event will be defined as e-meaningful if 



NFA = Noon/ B(p, n, k) $ . 
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We call in the following the left hand member of this inequality the “number of false 
alarms” (NFA). The number of false alarms of an event measures the “meaningfulness” 
of this event: the smaller it is, the more meaningful the event is. We refer to [6] for a 
complete discussion of this definition. To the best of our knowledge, the use of the bi- 
nomial tail, for alignment detection, was introduced by Stewart [30]. 



5.2 COMPUTATION OF SIX PARTIAL GESTALTS 



Alignments of points. 

Points will be called aligned if they all fall into a strip thin enough, and in sufficient num- 
ber. This qualitative definition is easily made quantitative. The precision of the alignment 
is measured by the width of the strip. Let S be a strip of width a. Let p(S) denote the prior 
probability for a point to fall in S, and let k(S) denote the number of points (among the M) 
which are in S. The following definition permits to compute all strips where a meaningful 
alignment is observed (see figures 15 and 18). 




Figure 15. An illustration of Helmholtz principle: non casual alignments are automatically detected by 
Helmholtz principle as a large deviation from randomness. Left, 20 uniformly randomly distributed dots, and 
7 aligned added. Middle: this meaningful (and seeable) alignment is detected as a large deviation from ran- 
domness. Right: same alignment added to 80 random dots. The alignment is no more meaningful (and no 
more seeable). In order to be meaningful, it would need to contain at least 11 points. 
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Definition 2 ([9]) A strip S is e-meaningful if 

NF(S) = Ns ■ B(p(S), M, k(S)) < £, 

where Ns is the number of considered strips ( one has Ns — 2n(R/af, where R is the half- 
diameter of Q and a the minimal width of a strip). 

Let us give immediately a summary of several algorithms based on the same principles, 
most of which use, as above, the tail of the binomial law. This is done in table 1, where we 
summarize the (very similar) formulae permitting to compute the following partial gestalts: 
alignments (of orientations in a digital image), contrasted boundaries, all kinds of simi- 
larities for some quality measured by a real number (grey level, orientation,...), and of 
course the most basic one, treated in the last row, namely the vicinity gestalt. 

Alignments in a digital image 

The first row of table 1 treats the alignments in a digital image. As we explained in sub- 
section 4.2, an orientation can be computed at each point of the image. Whenever a long 
enough segment occurs in the image, whose orientations are aligned with the segment, this 
segment is perceived as an alignement. We consider the following event: “on a discrete seg- 
ment of the image, joining two pixel centers, and with length /, at least k independent points 
have the same direction as the segment with precision p.” The definition of the number of 
false alarms is given in the first row of the table and an example of the alignments, in a dig- 
ital aerial image, whose number of false alarms is less than 1 is given in figure 16. 

Maximal meaningful gestalts and articulazione senza resti 
On this example of alignments, we can address a problem encountered by the mathe- 
matical formalization of gestalt. 
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GROUP LOOKED FOR 


MEASUREMENTS 


NUMBER OF FALSE ALARMS 


Alignment of directions on a segment [6] 


a discrete segment 
with points at 

Nyquist distance (i.e. 
2) 


k : number of aligned points 
/: number of points on the segment 


■N segments ’ B(p, l , k) 
^segment* — N 

N: number of pixels in the image 
p — 1/16 (angular precision) 


Contrasted edges and boundaries [7] 


a level line (or a piece 
of) with points at 
Nyquist distance (i.e. 
2) 


p : minimum contrast (gradient norm) 
along the curve 

/: length of the curve 


^level lines * ^f(/^) 

H is the empirical cumulative distri- 
bution of the gradient norm on the 
image 


Similarity of a uniform scalar quality (grey level, orientation, etc.) [9] 


a group of objects 
having a scalar qual- 
ity q such that a ^ 
q^b 


k: number of points in the group 
M: total number of objects 


L: number of values (g 6 { 1 ..L}) 


Similarity of a scalar quality with decreasing distribution (area, length, etc.) [9] 


a group of objects 
having a scalar qual- 
ity q such that a ^ 
q < b 


k: number of points in the group 
M : total number of objects 


2 } 

L: number of values (q G {1..L}) 

V: set of decreasing distributions on 
{!..£} 


Alignment of points (or objects) [9] 


a goup of points 
falling in a strip (re- 
gion enclosed by two 
parallel lines) 


p: relative area of the strip 
k: number of points falling in the strip 


N a tripa * B(p, M , fc) 

M : total number of points 

The strips are quantized in position, 

width and orientation 


Vicinity : clusters of points (or objects) [9] 


a group of points 
falling in a region 
enclosed by a low- 
resolution curve 


cr: relative area of the region 

a'\ relative area of the thick low- 

resolution curve 

k: number of points falling in the region 


M : total number of points 
Nregiona = N 2 qr2 L : the low resolu- 
tion curves are quantized in resolu- 
tion (<7), thickness (r), location (AT), 
and bounded in length (L). 



Table 1. List of gestalts computed so far. 



Assume that on a straight line we have found a very meaningful segment S. Then, by 
enlarging slightly or reducing slightly .S', we still find a meaningful segment. This means 
that meaningfulness cannot be a univoque criterion for detection, unless we can point out 
the “best meaningful” explanation of what is observed as meaningful. This is done by 
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the following definition, which can be adapted as well to meaningful boundaries [7], 
meaningful edges [7], meaningful modes in a histogram and clusters [9]. 

Definition 3 ([8]) We say that an e-meaningful geometric structure A is maximal mean- 
ingful if 

• it does not contain a strictly more meaningful structure: VB czA, NF(B) > NF(A). 

• it is not contained in a more meaningful structure: VB z>A,B # A , NF(B) > NF(A). 

It is proved in [8] that maximal structures cannot overlap, which is one of the main the- 
oretical outcomes validating the above definitions. This definition formalizes the artic- 
ulazione senza resti principle in the case of a single grouping law. 




Figure 
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Figure 17. Collaboration of gestalts. The objects tend to be grouped similarly by several different partial 
gestalts. First row: original DNA image (left) and its maximal meaningful boundaries (right). Second row, left: 
histogram of areas of the meaningful blobs. There is a unique maximal mode (256-416). The outliers are the 
double blob, the white background region and the three tiny blobs. Second row, middle: histogram of orienta- 
tions of the meaningful blobs (computed as the principal axis of each blob). There is a single maximal mean- 
ingful mode (interval). This mode is the interval 85-95. It contains 28 objects out of 32. The outliers are the 
white background region and three tiny spots. Second row, right: histogram of the mean grey levels inside each 
block. There is a single maximal mode containing 30 objects out of 32, in the grey level interval 74-130. The 
outliers are the background hite region and the darkest spot. 



Boundaries 

One can define in a very similar way the “boundary” grouping law. This grouping law 
is never stated explicitly in gestaltism, because it is probably too obvious for phenome- 
nologists. From the computation viewpoint it is not, at all. 
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The definition of the number of false alarms for boundaries involves again two vari- 
ables: the length l of the level line, and its minimal contrast (_l, which is interpreted as a 
precision. An example of boundary detection is given on figure 16. 

Similarity 

The third row of table 1 addresses the main gestaltic grouping principle: points or ob- 
jects having a feature in common are being grouped, just because they have this feature 
in common. Assume k objects Ok, among a longer list O i,.... On, have some quali- 
ty q in common. Assume that this quality is actually measured as a real number. Then 
our decision of whether the grouping of Oi,..., Ok is relevant must be based on the fact 
that the values q(Oi), ..., q{Ot) make a meaningfully dense interval of the histogram of 
q{0 1 ),..., q(On). Thus, the automatic quality grouping is led back to the question of an 
automatic, parameterless, histogram mode detector. Of course, this mode detector de- 
pends upon the kind of feature under consideration. 

We shall consider two paradigmatic cases, namely the case of orientations, where the 
histogram can be assumed by Helmholtz principle to be flat, and the case of the objects 
sizes (areas) where the null assumption is that the size histogram is decreasing (see fig- 
ure 17). 

Thus, the third and fourth row of our table permit to detect all kinds of similarity 
gestalt: objects grouped by orientation, or grey level, or any perceptually relevant scalar 
quality. 



5.3 STUDY OF AN EXAMPLE 

We are going to perform a complete study of a digital image, figure 1 7, involving all 
computational gestalts defined in table 1 . The analyzed image is a common digital im- 
age. It is a scan of photograph and has blur and noise. The seeable objects are elec- 
trophoresis spots which have all similar but varying shape and color and present some 
striking alignments. Actually, all of these perceptual remarks can be recovered in a ful- 
ly automatic way by combining several partial gestalt grouping laws. 

First, the contrasted boundaries of this electrophoresis image are computed (above, 
right). Notice 
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Figure 18. Gestalt grouping principles at work for building an “order 3” gestalt (alignment of blobs of the same 
size). First row: original DNA image (left) and its maximal meaningful boundaries (right). Second row: left, 
barycenters of all meaningful regions whose area is inside the only maximal meaningful mode of the histogram 
of areas; right, meaningful alignments of these points. 



that all closed curves found are indeed perceptually relevant as they surround the con- 
spicuous spots. Other many possible boundaries in the noisy background have been 
ruled out and remain “masked in texture”. Let us apply a second layer of grouping laws. 
This second layer will use as atomic objects the blobs found at the first step. For each of 
the detected boundaries, we compute three qualities, namely 

a) the area enclosed by the boundary, whose histogram is displayed on the top left of 
figure 17. There is a unique maximal mode in this figure, which actually groups all and 
exactly the blobs with similar areas and rules out two tiny blobs and a larger one en- 
closing two different blobs. Thus, almost all blobs get grouped by this quality, with the 
exception of two tiny spots and a double spot. 

b) the orientation of each blob, an angle between -n/2 and n/2. This histogram (figure 
17, bottom, middle) again shows a single maximal mode, again computed by the formula 
of the third row of table 1 . This mode appears at both end points of the interval, since 
the dominant direction is ±n/2 and these values are identified modulo n. Thus, about the 
same blobs as in b) get grouped by their common orientation. 



c) the average grey level inside each blob: its histogram is shown on the bottom right 
of figure 17. Again, most blobs, but not all get grouped with respect to this quality. 
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A further structural grouping law can be applied to build subgroups of blobs formed by 
alignment. This is illustrated in figure 18 (bottom, left), where the meaningful align- 
ments are found. This experiment illustrates the usual strong collaboration of partial 
gestalts: most salient objects or groups come to sight by several grouping laws. 



6. THE LIMITS OF EVERY PARTIAL GESTALT DETECTOR 

The preceding section argued in favor of a very simple principle, Helmholtz principle, ap- 
plicable to the automatic and parameterless detection of any partial gestalt, in full agree- 
ment with our perception. In this section, we shall show by commenting briefly several ex- 
periments that tout n 'est pas rose: there is a good deal of visual illusion in any apparently 
satisfactory result provided by a partial gestalt detector on a digital image. We explained 
in the former section that partial gestalts often collaborate. Thus, in many cases, what has 
been detected by a partial gestalt will be corroborated by another one. For instance, bound- 
aries and alignments in the experiment 16 are in good agreement. But what can be said 
about the experiment of figure 19? In this cheetah image, we have applied the alignment 
detector explained above. It works wonderfully on the grass leaves when they are straight. 
Now, we also see some unexpected alignments in the fur. These alignments do exist: these 
detected lines are tangent to several of the convex dark spots on the fur. This generates a 
meaningful excess of aligned points on these lines, the convex sets being smooth enough 
and having therefore on their boundary a long enough segment tangent to the detected line. 




Figure 19. Smooth convex sets or alignments? 





GESTALT THEORY AND COMPUTER VISION 



95 



We give an illustration of this fact in figure 20. The presence of only two circles can 
actually create a (to be masked) alignment since the circles have bitangent straight 
lines. In order to discard such spurious alignments, the convexity (or good continua- 
tion) gestalt should be systematically searched when we look for alignments. Then, 
the alignments which only are tangent lines to several smooth curves, could be in- 
hibited. 

We should detect as alignment what indeed is aligned, but only under the condition that 
the alignment does not derive from the presence of several smooth curves... This statement 
can be generalized: no gestalt is just a positive quality. The outcome of a partial gestalt de- 
tector is valid only when all other partial gestalts have been tested and the eventual con- 
flicts dealt with. 

The same argument applies to our next experimental example, in figure 21. In that 
case, a dense cluster of points is present. Thus, it creates a meaningful amount of dots 
in many strips and the result is the detection of obviously wrong alignments. Again, 
the detection of a cluster should inhibit such alignment detections. We defined an 
alignment as “many points in a thin strip", but must add to this definition: “provided 
these points do not build one or two dense clusters”. 




Figure 20. Alignment is masked by good continuation and convexity: the small segments on the right are per- 
fectly aligned. Any alignment detector should find them. All the same, this alignment disappears on the left fig- 
ure, as we include the segments into circles. In the same way, the casual alignments in the Cheetah fur (figure 
19) are caused by the presence of many oval shapes. Such alignments are perceptually masked and should be 
computationally masked! 



One can reiterate the same problematic with another gestalt conflict (figure 22). In this 
figure, a detector of arcs of circles has been applied. The arc of circle detection grouping 
law is easily adapted from the definition of alignments in table 1 . The main outcome of the 
experiment is this: since the MegaWave figure contains many smooth boundaries and sev- 
eral straight lines, lots of meaningful circular arcs are found. It may be discussed whether 
those circular arcs are present or not in the figure: clearly, any smooth curve is locally tan- 
gent to some circle. In the same way, two segments with an obtuse angle are tangent to sev- 
eral circular arcs (see figure 23). Thus, here again, a partial gestalt should mask another 
one. Hence the following statement, which is of wide consequence in Computer Vision: 
We cannot hope any reliable explanation of any figure by summing up the results of one or 
several partial gestalts. Only a global synthesis, treating all conflicts of partial gestalts, 
can give the correct result. 




96 



AGNES DESOLNEUX, LIONEL MOISAN AND JEAN-MICHEL MOREL 



In view of these experimental counterexamples, it may well be asked why partial gestalt de- 
tectors often work “so well’’. This is due to the redundancy of gestalt qualities in most natu- 
ral images, as we explained in the first section with the example of a square. Indeed, most 
natural or synthetic objects are simultaneously conspicuous, smooth and have straight or con- 
vex parts, etc. Thus, in many cases, each partial gestalt detector will lead to the same group 
definition. Our experiments on the electrophoresis image (figure 17) have illustrated the col- 
laboration of gestalt phenomenon 7 . In that experiment, partial gestalts collaborate and seem 
to be redundant. 

This is an illlusion which can be broken when partial gestalts do not collaborate. 




Figure 21. One cluster, or several alignments? 
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Figure 22. Left: original “MegaWave” image. Right: a circular arc detector is applied to the image. Now, this 
image contains many smooth curves and obtuse angles to which meaningful circular arcs are tangent. This il- 
lustrates the necessity of the interaction of partial gestalts: the best explanation for the observed structures is 
“good continuation” in the gestaltic sense, i.e. the presence of a smooth curve, or of straight lines (alignments) 
forming obtuse angles. Their presence entails the detection of arcs of circles which are not the final explanation. 





Figure 23. Left: Every obtuse angle can be made to have many points in common with some long arc of circle. 
Thus, an arc of circle detector will make wrong detections when obtuse angles are present (see Figure 22). In 
the same way, a circle detector will detect the circle inscribed in any square and conversely, a square detector 
will detect squares circumscribing any circle. 
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6.1. CONCLUSION 

We shall close the discussion by expressing some hope, and giving some arguments in 
favor of this hope. First of all, gestaltists pointed out the relatively small number of rel- 
evant gestalt qualities for biological vision. We have briefly shown in this paper that 
many of them (and probably all) can be computed by the Helmholtz principle followed 
by a maximality argument. Second, the discussions of gestaltists about “conflicts of 
gestalts”, so vividly explained in the books of Kanizsa, might well be solved by a few 
information theoretical principles. As a good example of it, let us mention how the 
dilemma alignment-versus-parallelism can be solved by an easy minimal description 
length principle (MDL) [26], [8] . Figure 24 shows the problem and its simple solution. 
On the middle, we see all detected alignments in the Brown image on the left. Clearly, 
those alignments make sense but many of them are slanted. The main reason is this: all 
straight edges are in fact blurry and therefore constitute a rectangular region where all 
points have roughly the same direction. Thus, since alignment detection is made up to 
some precision, the straight alignments are mixed up with slanted alignments which still 
respect the precision bound. We can interpret the situation as a conflict between align- 
ment and parallelism, as already illustrated in figure 8. 




Figure 24. Parallelism against alignment. Left, original Brown image. Middle: maximal meaningful align- 
ments. Here, since many parallel alignments are present, secondary, parasite slanted alignments are also 
found. Right: Minimal description length of alignments, which eliminates the spurious alignments. This last 
method outlines a solution to conflicts between partial gestalts. 
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The spurious, slanted alignments are easily removed by the application of a MDL prin- 
ciple: it is enough to retain for each point only the most meaningful alignment to which 
it belongs. We then compute again the remaining maximal meaningful alignments and 
the result (right) shows that the conflict between parallelism and alignment has been 
solved. Clearly, information theoretical rules of this kind may be applied in a general 
framework and put order in the proliferation of “partial gestalts”. Let us mention an at- 
tempt of this kind in (20], where the author proposed a MDL reformulation of segmen- 
tation variational methods ([24]) 
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NOTES 

1 We shall write gestalt and treat it as an English word when we talk about gestalts as groups. We maintain 
the german uppercase for Gestalt theory. 

2 The good continuation principle has been extensively adressed in Computer Vision, first in [23], more re- 
cently in [27] and still more recently in [11]. A recent example of computer vision paper implementing “good 
continuation”, understood a “constant curvature”, is [32]. 

3 Op. cit. 

4 Vedere e pensare op. cit., p 184 

5 “ non sono da considerare mascherati gli elementi troppo piccoli per raggiungere la soglia della visibilita 
pur potendo essere rivelati con Vausilio di una lente di ingrandimento, il che dimostra che esistono come sti- 
moli potenziali. E altrettanto vale per il caso inverso, nel quale soltanto con la diminuzione dell'angolo visivo 
e la conseguente riduzione della grandezza degli elementi e dei loro interspazi ( mediante una lente o la visione 
a grande distanza) e possibile vedere determinate strutture”. 

6 It is well known by gestaltists that a right angle “looks right” with some ±3 degrees precision, and other- 
wise does not look right at all. 
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7 Kruger and Worgotter [19] gave strong statistical computational evidence in favor of a collaboration be- 
tween partial gestalt laws namely collinearity, parallelism, color, contrast and similar motion. 
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TOWARDS AN ANALYTIC PHENOMENOLOGY: 

THE CONCEPTS OF “BODILINESS” AND “GRABBINESS” 1 



1. PHENOMENAL CONSCIOUSNESS 

In this paper, we present an account of phenomenal consciousness. Phenomenal con- 
sciousness is experience, and the problem of phenomenal consciousness is to explain 
how physical processes - behavioral, neural, computational - can produce experience. 
Numerous thinkers have argued that phenomenal consciousness cannot be explained 
in functional, neural or information-processing terms (e.g. Block 1990, 1994; 
Chalmers 1996). Different arguments have been put forward. For example, it has been 
argued that two individuals could be exactly alike in functional/computational/behav- 
ioral measures, but differ in the character of their experience. Though such persons 
would behave in the same way, they would differ in how things felt to them (for ex- 
ample, red things might give rise to the experience in one that green things give rise to 
in the other). Similarly, it has been held that two individuals could be 
functionally/computationally/behaviorally alike although one of them, but not the oth- 
er, is a mere zombie, that is, a robot-like creature who acts as if it has experience but 
is in fact phenomenally unconsciousness. For any being, it has been suggested, the 
question whether it has experience (is phenomenally conscious) cannot be answered 
by determining that it is an information-processor of this or that sort. The question is 
properly equivalent to the question whether there is anything it is like to be that indi- 
vidual (Nagel 1974). Attempts to explain consciousness in physical or information- 
processing terms sputter: we cannot get any explanatory purchase on experience when 
we try to explain it in terms of neural or computational processes. Once a particular 
process has been proposed as an explanation, we can then always reasonably wonder, 
it seems, what it is about that particular process that makes it give rise to experience. 
Physical and computational mechanisms, it seems, require some further ingredient if 
they are to explain experience. This explanatory shortfall has aptly been referred to as 
“the explanatory gap” (Levine 1983). 

We suggest that the explanatory gap is a product of a way of thinking about con- 
sciousness which sets up three obstacles to an explanation, that is, three reasons for hold- 
ing that the explanatory gap is unbridgeable. In this paper we propose ways of sur- 
mounting these obstacles, and in this way try to lay the foundations for a science of phe- 
nomenal consciousness. 

What is it exactly about phenomenal consciousness which makes it seem inaccessible 
to normal scientific inquiry? What is so special about “feel'’? 
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2. THE FIRST OBSTACLE: THE CONTINUOUSNESS OF EXPERIENCE 

A first remarkable aspect about experience is that it seems ‘continuous’. Experiences 
seem to be “present” to us, and to have an “ongoing” or “occurring” quality which we 
might picturesquely describe as like the buzzing, whirring, or humming of a machine. 

Many scientists believe that to explain the ongoingness of experience we must uncover 
some kind of neural process or activity that generates this ongoingness. But this is a mis- 
take (Dennett 1991; Pessoa, Thompson and Noe 1998;). To see why, consider an analogy. 
Most people would agree that there is something it is like to drive a car, and different cars 
have different “feels”. You have the Porsche driving feel when you know that if you press 
the accelerator, the car will whoosh forwards, whereas nothing comparable happens in oth- 
er cars. In a Porsche, if you just lightly touch the steering wheel, the car swerves around, 
whereas most other cars react more sluggishly. In general: the feel of driving a car, truck, 
tank, tractor or golf-cart corresponds to the specific way it behaves as you handle it. 

Now as you drive the Porsche, you are having the ongoing Porsche driving feel. But 
notice that as you drive you can momentarily close your eyes, take your hands off the 
steering wheel and your foot off the accelerator, yet you are still having the Porsche driv- 
ing feel even though you are getting virtually no Porsche-related sensory input. This is 
because the Porsche driving feel does not reside in any particular momentary sensory in- 
put, but rather in the fact that you are currently engaged in exercising the Porsche driv- 
ing skill. 

If the feel of Porsche driving is constituted by exercising a skill, perhaps the feel of red, 
the sound of a bell, the smell of a rose also correspond to skills being exercised. Taking 
this view about what feel is would have a tremendous advantage: we would have crossed 
the first hurdle over the explanatory gap, because now we no longer need a magical neu- 
ral mechanism to generate ongoing feel out of nerve activities. Feel is now not “gener- 
ated” by a neural mechanism at all, rather, it is exercising what the neural mechanism al- 
lows the organism to do. It is exercising a skill that the organism has mastery of. 

An analogy can be made with “life”: life is not something which is generated by some 
special organ in biological systems. Life is a capacity that living systems possess. An or- 
ganism is alive when it is has the potential to do certain things, like replicate, move, me- 
tabolize, etc. But it need not be doing any of them right now, and still it is alive. 

It may seem very peculiar to conceive of say, the feel of red, as a skill being exercised, 
but we shall see the possibility of this position, as well as its advantages, in the next sec- 
tions. The idea and its implications has been developed in our previous papers (O'Regan 
& Noe 2001a; O'Regan & Noe 2001b; O'Regan & Noe 2001c; Myin & O'Regan 2002; 
cf. also Clark 2000; Grush 1998; Jarvilheto 2001; Myin 2001, Pettit 2003a, b for similar 
recent views). 



A CONSEQUENCE OF THE “ SKILL ” IDEA: CHANGE BLINDNESS 



When we look out upon the world, we have the impression of seeing a rich, continu- 
ously present visual panorama spread out before us. Under the idea that seeing involves 
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exercising a skill however, the richness and continuity of this sensation are not due to the 
activation in our brains of a neural representation of the outside world. On the contrary, 
the ongoingness and richness of the sensation derive from the knowledge we have of the 
many different things we can do (but need not do) with our eyes, and the sensory effects 
that result from doing them (O'Regan 1992). Having the impression of a whole scene 
before us comes, not from every bit of the scene being present in our minds, but from 
every bit of the scene being immediately available for “handling” by the slightest flick 
of the eye. 

But now a curious prediction can be made. Only part of the scene can be being “han- 
dled” at any one moment. The rest of the scene, although perceived as present, is actu- 
ally not being handled. If such currently un-handled scene areas were to be surrepti- 
tiously replaced, the change should go unnoticed. 

Under normal circumstances any change made in a scene will provoke an eye move- 
ment to the locus of the change. This is because there are hard-wired detectors in the vi- 
sual system that react to any sudden change in local luminance and cause attention to fo- 
cus on the change. (We will come back to this important property of the visual system 
under the heading of “grabbiness” in Section 3.) 

But by inserting a blank screen or “flicker” (Rensink, O’Regan & Clark 2000), or else 
an eye movement, a blink, “mudsplashes” (O’Regan, Rensink & Clark 1999), or a film 
cut between successive images in a sequence of images or movie sequence (for a review 
see Simons 2000), the sudden local luminance changes that would normally grab atten- 
tion and cause perceptual handling of a changing scene aspect are drowned out by the 
mass of other luminance changes occurring in the scene. There will no longer be a sin- 
gle place that the observers’ attention will be attracted to, and so we would expect that 
the likelihood of “handling” and therefore perceiving the location where the scene 
change occurs would be low. 

And indeed that is what is found: surprisingly large changes, occupying areas as large 
as a fifth of the total picture area, can be missed. This is the phenomenon of “change 
blindness” (demonstrations can be found on http://nivea.psycho.univ-paris5.fr and 
http://viscog.beckman.uiuc.edu/change/). 



3. THE SECOND OBSTACLE: THE QUALITATIVE CHARACTER OF EXPERIENCE 

In the previous section we showed that by taking the view that experiences depend on 
the exercise of skills, we can forego the search for neural processes that are, like the ex- 
periences themselves, ongoing. We no longer need to postulate a magical neural process 
that “generates” phenomenal consciousness, because, we claim, phenomenal conscious- 
ness is not generated: rather it is a skill people exercise. 

We now come to the second difficulty in explaining experience. 

Suppose you are thinking about your grandmother. You can cast your attention on the 
color of her eyes, the sound of her voice, the smell of her perfume. Nevertheless, think- 
ing about your grandmother is nothing like actually seeing her: thinking has no percep- 
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tual phenomenal quality. Why is this? Why is there something it is like to have a per- 
ceptual experience (Nagel 1974)? This question forms the second obstacle that would 
seem to bar our path towards understanding phenomenal consciousness. 

The key, we propose, has to do with distinct properties of the kinds of skills that we ex- 
ercise when we undergo conscious experience and that make these skills different from 
other skills (practical skills such as the ability to drive, cognitive skills, etc). These as- 
pects are bodiliness and grabbiness. 



BODILINESS 

If you really are looking at your grandmother and you turn your eyes, blink, or move 
your body, there will be an immediate and drastic change in the incoming sensory in- 
formation about your grandmother. On the other hand, nothing at all will happen if you 
are merely thinking about your grandmother. 

Bodiliness is the fact that when you move your body, incoming sensory information 
immediately changes. The slightest twitch of an eye muscle displaces the retinal image 
and produces a large change in the signal coming along the optic nerve. Blinking, mov- 
ing your head or body will also immediately affect the incoming signal. As concerns au- 
ditory information, turning your head immediately affects the phase and amplitude dif- 
ference between signals coming from the two ears, etc. 

Bodiliness is one aspect of sensory stimulation which makes it different from other 
forms of stimulation, and contributes to giving it its peculiar quality. Because of bodili- 
ness, sensory information has an “intimate” quality: it’s almost as though it were part of 
your own body. 



GRABBINESS 

Suppose that minor brain damage destroys your knowledge about your grandmother’s 
eyeglasses. Are you immediately aware that this has happened? No, the loss of the mem- 
ory of your grandmother’s glasses causes no whistle to blow in your mind to warn you. 
Only when you cast your mind upon the memory of your grandmother do you actually 
realize that you no longer know what her glasses were like. 

But consider what happens if instead of thinking about your grandmother, you are ac- 
tually looking at her. Even if you are not paying attention to her glasses in particular, if 
they should suddenly disappear, this would inevitably grab your attention: the sudden 
change would trigger local motion detectors in your low-level visual system, and an eye 
saccade would immediately be peremptorily programmed towards the location of the 
change. Your attentional resources would be mobilized and you would orient towards the 
change. This “grabbiness” of sensory stimulation, that is, its capacity to cause automat- 
ic orienting responses, is a second aspect which distinguishes it from other types of neu- 
ral activity in the brain. Grabbiness is the fact that sensory stimulation can grab your at- 
tention away from what you were previously doing. 
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TOWARDS AN ANALYTIC PHENOMENOLOGY 

Our claim is that bodiliness and grabbiness are jointly responsible for giving the par- 
ticular qualitative character to the exercise of sensorimotor skills which people have in 
mind when they talk of the “feel” of sensation or experience. Because of bodiliness, you 
are in a way “connected” to sensory stimulation: it changes with your minutest body mo- 
tion. Because of grabbiness, you somehow can’t get away from sensory stimulation: it 
has the capacity to monopolize your attention and keep you in contact with it. Bodiliness 
and grabbiness ensure that, unlike thoughts and memories, sensory stimulation has a 
“clinging” quality. Unlike thoughts and memories, experiences follow you around like a 
faithful dog. Furthermore, like the dog, they force themselves upon you by grabbing 
your attention whenever anything unexpected happens in the world. We suggest that 
bodiliness and grabbiness may be the reason why there is something it’s like to have a 
sensation. 

Note an important point about the concepts of bodiliness and grabbiness: they are phys- 
ically measurable quantities. A scientist should be able to come in and measure how 
much bodiliness and how much grabbiness there is in different types of sensory stimu- 
lation. The amount of bodiliness is determined by the degree to which sensory input de- 
pends on body motions. The amount of grabbiness is determined by the extent to which 
an organism’s orienting responses and processing resources are liable to be grabbed by 
the input. 

If bodiliness and grabbiness are objectively measurable quantities, and if we are right 
in saying that they determine whether a sensory input has “feel”, then we should be able 
to predict how much "feel” different mental phenomena have. 

We have already seen that memory phenomena, like the memory of your grandmother, 
or thoughts or knowledge, have little or no bodiliness and no grabbiness. They have lit- 
tle feel, therefore. This seems to correspond with what people say about memory, 
thoughts and knowledge. 

We have also seen that experiences, like the experience of seeing the color of your 
grandmother’s eyes, have bodiliness and grabbiness, and should be perceived as pos- 
sessing “feel”. 

Now it is interesting to consider whether there exist intermediate cases. If we are right 
about the relation between bodiliness, grabbiness and feel, then cases of a little bit of 
bodiliness and grabbiness should correspond to a little bit of feel. 

Indeed a case in point is Porsche driving. In Porsche driving, some of your body move- 
ments produce immediate changes in sensory input - pressing the accelerator, touching 
the wheel, etc. But most of your body movements do not change sensory input related 
to the Porsche driving experience. Turning your head changes visual input, but the 
change is not specific to the Porsche driving feel - rather it constitutes the feel charac- 
teristic of vision. Sniffing your nose gives you the smell of leather, but that’s specific to 
the sense of smell. Those very particular sensorimotor contingencies which determine 
the feel of Porsche driving are restricted to a very particular set of behaviors which are 
specific to driving, namely those to do with how touching the wheel or pressing the ac- 
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celerator affects what the car does. You can't get the feel of a car by just waving your 
hands around in the air. You have to actually be exercising the car-driving skill. 

The situation is quite different from the feel of seeing red or hearing a bell, say, where 
almost any small body twitch or muscle movement in the perceptual system involved 
causes drastic sensory changes (high bodiliness). Moreover, if anything in your visual 
field suddenly turns red, or if suddenly a bell starts ringing near you, you will be imme- 
diately alerted (high grabbiness). 

We thus expect - and this corresponds well with what people say about the feel of driv- 
ing - that it makes sense to say that Porsche driving has a feel, but the feel is less inti- 
mate, less direct, less “present” than the sensation associated with seeing red or hearing 
a bell, because the latter have bodiliness and grabbiness to a much higher degree. 

Another interesting intermediate case is the feeling of being rich. What is being rich? 
It is knowing that if you go to your bank you can take out lots of money; it is knowing 
you can go on an expensive trip and that you needn’t worry about the price of dinner. 

Thus being rich has a certain degree of bodiliness, because there exist things you can 
do with your body which have predictable sensory consequences (e.g. you can make the 
appropriate maneuvers at the cash dispenser and the money comes out). But clearly, 
again, the link with body motions is not nearly as direct as in true sensory stimulation 
like seeing, when the slightest motion of virtually any body part creates immediate 
changes in sensory input. So being rich can hardly be said to have very much bodiliness. 

Similarly, being rich also has no grabbiness. If your bank makes a mistake and sud- 
denly transfers all your assets to charity, no alarm-bell rings in your mind to tell you. No 
internal mind-siren attracts your attention when the stock market suddenly goes bust: 
you only find out when you purposely check the news. 

Further interesting cases concern obsessive thoughts and experiences like worry and anx- 
iety, as well as embarrassment, fear, love, happiness, sadness, loneliness and homesick- 
ness. These are more grabby than normal thinking, because you cannot but help thinking 
about them. Some of these phenomena also have a degree of bodiliness, because there are 
things you can do to change them: for homesickness you can go home, for happiness you 
can remove the things that make you happy. Clearly there is “something it’s like” to expe- 
rience these mental phenomena, but the quality they have is not of a sensory nature 2 . 

It is interesting to consider also the case of proprioception: this is the neural input that 
signals mechanical displacements of the muscles and joints. Motor commands which 
give rise to movements thus necessarily produce proprioceptive input, so proprioception 
has a high degree of bodiliness. On the other hand, proprioception has no grabbiness: 
body position changes do not peremptorily cause you to attend to them. Thus, as ex- 
pected from the classification we are putting forward, while we generally know where 
our limbs are, this position sense does not have a sensory nature. 

The vestibular system detects the position and motion of the head, and so vestibular in- 
puts have bodiliness. They also have some grabbiness, since sudden extreme changes in 
body orientation immediately result in re-adjusting reactions and grab your attention, 
sometimes provoking dizziness or nausea. In this sense then, the vestibular sense has a 
limited degree of sensory feel. 
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The examples given here are simply a first attempt to use the notions of bodiliness and 
grabbiness to make a classification of phenomenal processes (but see also O' Regan and 
Noe 2001a). Further work is needed in this direction. Additionally it may be useful to 
consider the possibility that there are other objective dimensions that may be useful in 
creating what could be called an “analytic phenomenology” based on objectively meas- 
urable quantities like bodiliness and grabbiness. In particular, to deal adequately with 
pain and emotions we may additionally need the concept of “automaticity”, which meas- 
ures the degree to which a stimulation provokes an automatic behavior on the part of the 
organism. 



SUMMARY 

We have seen that, when added to the idea that feels correspond to having mastery of 
skills, the concepts of bodiliness and grabbiness allow the fundamental difference to be 
captured between mental phenomena that have no feel, like memory and knowledge, and 
mental phenomena that have feel, like sensations. Bodiliness and grabbiness furthermore 
allow us to understand why some intermediate situations, like driving or being rich can 
also be qualified as possessing a certain, but lesser, degree of “feel”. Bodiliness and 
grabbiness are objectively measurable quantities that determine the extent to which there 
is something it’s like to have a sensation. We suggest that bodiliness and grabbiness 
therefore allow us to pass the second obstacle to overcoming the explanatory gap. They 
explain why there is something it is like to feel. 



4. THIRD OBSTACLE: MODALITY AND SENSORY QUALITY 

To explain the nature of experience it is necessary not only to explain why there is 
something it is like to have an experience, one must also explain why it is like this, rather 
than like that (Hurley and Noe 2003; Chalmers 1995). 

For example hearing involves a different quality as compared to seeing, which has a 
different quality as compared to tactile sensation. Furthermore, within a given sensory 
modality there are differences as well: for example, red has a different quality from 
green. This is the third major obstacle to an account of phenomenal consciousness. 

Explaining these differences in neural terms will not work: Neural activation is simply 
a way of coding information in the brain. As of now, we have no clue how differences 
in the code could ever give rise to differences in feel. 

But if we consider experiences as skills, then we can immediately see where their dif- 
ferences in phenomenal quality come from: they come from the nature of the different 
skills you exercise. Just as Porsche driving is a different skill from tractor driving, the 
difference between hearing and seeing amounts to the fact that among other things, you 
are seeing if, when you blink, there is a large change in sensory input; you are hearing 
if nothing happens when you blink, but, there is a left/right difference when you turn 
your head; the amplitude of the incoming auditory signal varies in a certain lawful way 
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when you approach a sound source, etc. We call these relations between possible actions 
and resultant sensory effects: sensorimotor contingencies (O’Regan & Noe 2001b). 

SENSORY SUBSTITUTION 

From this follows a curious prediction. We claim that the quality of a sensory modali- 
ty does not derive from the particular sensory input channel or neural circuitry involved 
in that modality, but from the laws of sensorimotor contingency that are involved. It 
should therefore be possible to obtain a visual feel from auditory or tactile input, for ex- 
ample, provided the sensorimotor laws that are being obeyed are the laws of vision (and 
provided the brain has the computing resources to extract those laws). 

Such “sensory substitution” has been experimented with since (Bach-y-Rita 1967) con- 
structed the first device to allow blind people to see via tactile stimulation provided by 
a matrix of vibrators connected to a video camera. Today there is renewed interest in this 
field, and a number of new devices are being tested with the purpose of substituting dif- 
ferent senses: visual-to-tactile (Sampaio, Maris, & Bach-y-Rita 2001); echolocation-to- 
auditory (Veraart, Cremieux, & Wanet-Defalque 1992); visual-to-auditory (e.g. Meijer 
1992; Arno, Capelle, Wanet-Defalque, Catalan- Ahumada, & Veraart (1999)); auditory- 
to-tactile (cf. Richardson and Frost 1977 for review). Such devices are still in their in- 
fancy. In particular, no systematic effort has been undertaken up to now to analyze the 
laws of sensorimotor contingency that they provide. In our opinion it will be the simi- 
larity in the sensorimotor laws that such devices recreate which determines the degree to 
which users will really feel they are having sensations in the modality being substituted. 

Related phenomena which also support the idea that the feel of a sensory modality is 
not wired into the neural hardware, but is rather a question of sensorimotor contingen- 
cies comes from the amusing experiment of Botvinick & Cohen (1998), where the “feel” 
of being touched can be transferred from your own body to a rubber replica lying on the 
table in front of you (see also interesting work on the body image in tool use by Ya- 
mamoto & Kitazawa 2001; Iriki, Tanaka, & Iwamura 1996). The finding of (Roe et al. 
1990) according to which embryonically “rewired” ferrets can see with their auditory 
cortex can also be interpreted within the context of our theory. 

INTRAMODAL SENSORY DIFFERENCES 

We have seen that the feel of different sensory modalities can be accounted for by the 
different things you do when you use these modalities. But what about the differences 
within a given sensory modality: can we use the same arguments? 

Within the tactile modality, this idea seems quite plausible. Consider the feel of a hard 
surface and the feel of a soft surface. Does this difference come from different kinds of 
tactile receptors being activated, or from the receptors being activated in different ways? 
No, we argue, since receptor activations are only codes that convey information - they 
are necessary for feel, but cannot by themselves generate the feel of hard and soft. On 
the contrary, we claim the difference between hard and soft comes from the different 
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skills that you implicitly put to work when you touch hard and soft surfaces: the fact that 
when you push on a hard surface it resists your pressure; when you push on a soft sur- 
face, it gives way. The feel of hard and soft are constituted by the things you implicitly 
know about how the surface will react to your ongoing exploration. 

Now while this makes sense for tactile exploration, it might seem difficult to apply the 
same approach to other sensory modalities: what has the difference between red and 
green for example, got to do with sensorimotor contingencies? How can the feel of red 
consist in doing something, and the feel of green consist in doing something else? 

But consider what happens when you look at a red piece of paper. Depending on which 
way you turn the paper, it can reflect more of bluish sky light or more of yellowish sun- 
light from your window, or more of reddish lamplight from your desk. We suggest that 
one aspect of the feel of red is: knowing the laws that govern the changes in the light re- 
flected off the paper as you turn it (cf. Broackes 1992). 

Another aspect of the skill involved in the feel of red concerns retinal sampling. Reti- 
nal sampling of a centrally fixated red patch is done by a densely packed matrix of short, 
medium and long-wavelength sensitive cones. There is also a yellowish macular pigment 
which covers the central retina. When an eye movement brings the patch into peripher- 
al vision, the cone matrix that samples the patch is interspersed with rods, the distribu- 
tion is slightly different, and there is no macular pigment. The resultant change in qual- 
ity of the incoming sensory stimulation is another aspect of what it is like to be looking 
at a red patch. 



5. SUMMARY: HOW WE HAVE CROSSED THE GAP 

We have presented arguments showing how three obstacles to understanding experi- 
ence can be circumvented. 

The first obstacle was the fact that experiences appear to be ongoing, occurrent 
processes inside us. This has led scientists to seek for brain mechanisms which are them- 
selves also ongoing, and whose activity gives rise to feel. But we claim that any such 
quest is doomed, since the question will always ultimately remain of how activity of a 
physical system, no matter how complex or abstruse, can give rise to “feel”. 

Our solution is to show that feel is not directly generated by a brain mechanism, but 
consists in the active exercising of a skill, like driving or bicycle riding. The ongoing- 
ness of feel is not “produced” or “secreted” by brain activity, but resides in the active do- 
ing, the give-and-take that is involved in exercising a particular skill. 

The second barrier to explaining feel is the question of there being something it is like 
to have the experience, that is, of the experience having a qualitative character. We 
showed how the concepts of bodiliness and grabbiness allow the fundamental difference 
to be captured between mental phenomena that have no feel, like memory and knowl- 
edge, and mental phenomena that have feel, like experiences or sensations. Bodiliness 
and grabbiness are objectively measurable quantities that determine the extent to which 
there is something it’s like to have a sensation. Bodiliness and grabbiness allow us to 
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pass the second obstacle to overcoming the explanatory gap. They explain why there is 
something it’s like to feel. 

The third obstacle preventing a scientific explanation of experience was that it was dif- 
ficult to understand how different types of neural activation could give rise to different 
types of experience, e.g. experiential differences within and between sensory modalities: 
after all, neural activations are just arbitrary codes for information, and information in it- 
self has no feel. 

A natural solution comes from the idea that differences in the feel of different sense 
modalities correspond to the different skills that are involved in exercising each modal- 
ity. This idea can also be made to work within a given sense modality, explaining the 
what-it-is-like of red versus green in terms of the different things you do when you are 
exploring red and green. 



HOW TO MAKE A ROBOT FEEL 

With these tools in hand, can we build a robot that feels? 

We provide the robot with mastery of the laws that govern the way its actions affect its 
sensory input. We wire up its sensory receptors so that they provide bodiliness and we 
ensure grabbiness by arranging things so that sudden sensory changes peremptorily mo- 
bilize the robot’s processing resources. Will the robot now have “feel”? 

No, one more thing is necessary: the robot must have access to the fact that it has mas- 
tery of the skills associated with its sensory exploration. That is, it must be able to make 
use of these sensory skills in its thoughts, planning, judgment and (if it talks) in its lan- 
guage behavior. 

Reasoning, thought, judgment and language are aspects of mind where Af and robotics 
have not yet reached human levels. But there is no a priori, logical argument that pre- 
vents this from being possible in the future. This is because there is no barrier in princi- 
ple that prevents reasoning, thought, judgment, and language from being described in 
functional terms. They are therefore in principle amenable to the scientific method and 
can theoretically be implemented by an information-processing device. Of course, be- 
cause human reasoning is intricately linked with human culture and social interaction, it 
may not be possible to satisfactorily replicate human reasoning without also replicating 
the social and developmental process through which each human goes. 

But when we manage to do this, then if we make a robot whose sensory systems pos- 
sess bodiliness and grabbiness, then the robot will feel. Indeed, it will feel for the same 
reasons that we do, namely because we have access to our mastery of sensory skills, and 
because of the bodiliness and grabbiness of sensory inputs. 
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NOTES 

1 This paper offers a theoretical overview of ideas developed in an a series of recent papers - O’ Regan and 
Noe 2001a, b; c; Myin and O’Regan 2002; Noe and O’Regan 2000; 2002; Noe 2002; O’Regan 1992 - and al- 
so in work in progress by the authors. 

2 But note that the grabbiness involved in these phenomena is “mental” or “psychological” rather than sen- 
sory: it is not automatic orienting of sensory systems, but rather uncontrollable, obsessive mental orienting. 
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INTERNAL REPRESENTATIONS OF SENSORY INPUT REFLECT THE MOTOR 
OUTPUT WITH WHICH ORGANISMS RESPOND TO THE INPUT 



1. INTRODUCTION 

What determines how sensory input is internally represented? The traditional answer is 
that internal representations of sensory input reflect the properties of the input. This answer 
is based on a passive or contemplative view of our knowledge of the world which is root- 
ed in the philosophical tradition and, in psychology, appears to be almost mandatory giv- 
en the fact that, in laboratory experiments, it is much easier for the researcher to control 
and manipulate the sensory input which is presented to the experimental subjects than the 
motor output with which the subjects respond to the input. However, a minority view 
which is gaining increasing support (Gibson, 1986; O’Regan and Noe, in press) is that in- 
ternal representations are instead action-based, that is, that the manner in which organisms 
internally represent the sensory input reflects the properties of the actions with which the 
organisms respond to the sensory input rather than the properties of the sensory input. 

In this chapter we describe a series of computer simulations using neural networks that 
tend to support the action-based view of internal representations. Internal representations 
in neural networks are not symbolic or semantic entities, like cognitivist representations 
(Fodor, 1981), but they are patterns of activation states in the network’s internal units 
which are caused by input activation patterns and which in turn cause activation patterns 
in the network’s output units. Our networks are sensory-motor neural networks. Their in- 
put units encode sensory input and their output units encode changes in the physical lo- 
cation of the organism’s body or body parts, i.e., movements. We train networks to exe- 
cute a number of sensory-motor tasks and by examining their internal representations at 
the end of training we determine whether these internal representations co-vary with the 
properties of the sensory input or with the properties of the motor output. 

The chapter describes three sets of simulations. In the first set we distinguish between 
micro-actions and macro-actions and we show that both micro-actions and macro-ac- 
tions are real for neural networks. Micro-actions are the successive movements that 
make up an entire goal-directed action, and each micro-action is encoded in the activa- 
tion pattern observed in the network’s motor output units in a single input/output cycle. 
Macro-actions are sequences of micro-actions that allow the organism to reach some 
goal. Our simulations show that internal representations encode, i.e., reflect the proper- 
ties of, macro-actions. In the second set of simulations we show that if there is a suc- 
cession of layers of internal units from the sensory input to the motor output the layers 
which are closer to the sensory input will tend to reflect the properties of the input and 
those closer to the motor output the properties of the output. However, in the third and 
final set of simulations we also show that the actions with which the organism responds 
to the input dictate the form of internal representations as low down the succession of in- 
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ternal layers and as close to the sensory input as is necessary for producing the appro- 
priate actions in response to the input. 



2. MACRO-ACTIONS AND MICRO-ACTIONS 

The behavior of organisms can be described at various levels of integration. An organ- 
ism can be said to be “reaching for an object with its arm”, or one can describe the se- 
quence of micro-movements of the organism's arm that allow the organism to reach the 
object. The first type of description is in terms of macro-actions, the second one in terms 
of micro-actions (or micro-movements). Macro-actions are composed of sequences of 
micro-actions and typically one and the same macro-action is realized in different occa- 
sions by different sequences of micro-actions. The object can be in different spatial lo- 
cations or the arm’s starting position can vary and, as a consequence, the arm’s trajecto- 
ry will be different and will be composed of different sequences of micro-actions al- 
though at the macro-action level the organism is in all cases “reaching for an object”. 

Behavior can be modeled using neural networks, which are computational models in- 
spired by the physical structure and way of functioning of the nervous system (Rumelhart 
& McClelland, 1986). Neural networks are sets of units (neurons) that influence each oth- 
er through their connections (synapses between neurons). Activation states propagate from 
input units to internal units to output units. The network’s behavior, i.e., the way in which 
the neural network responds to the input by generating some particular output, depends on 
the network’s connection weights. Neural networks are trained in such a way that the ini- 
tial random connection weights are progressively modified and at the end of training the 
neural network exhibits the desired behavior. 

One can train networks to exhibit the behavior of reaching for objects using a two-seg- 
ment arm. Some external event, e.g., the light reflected by an object, determines a par- 
ticular activation pattern in the network’s (visual) input units. The activation propagates 
through the network until it reaches the output units and determines a particular activa- 
tion pattern in the network’s output units which is then translated into a micro-movement 
of the arm. This is repeated for a succession of input/output cycles until the arm’s end- 
point (the hand) reaches the object. Once the behavior of reaching for objects has been 
acquired by the network, the behavior can be described at the macro- and at the micro- 
level. For example, one can either say “the network is reaching the object in the left por- 
tion of the visual space”, which corresponds to an entire sequence of output activation 
patterns, or one can describe each single output activation pattern which controls a sin- 
gle micro-action of the arm as the arm “is reaching the object on the left”. 

The first problem addressed in this chapter is whether the distinction between macro-ac- 
tions and micro-actions makes sense only for the researcher who is describing the behav- 
ior of the network (or of a real organism) or is also appropriate from the point of view of 
the network itself (or the organism). Micro-actions obviously are real for the network in 
that in each cycle one can “read” the activation pattern in the network’s output units which 
determines a micro-action. However, one can doubt that macro-actions are also real for the 
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network in that it is not at all clear where one can find macro-actions in the network’s struc- 
ture or organization. The network can be said to know “how to move the arm to reach the 
object in the left portion of the visual space” because this is what the network does but this 
knowledge seems to be purely implicit, not explicit. Explicitly, the network only knows 
how to produce the slight displacements of the arm which are encoded in the activation 
pattern of its output units in each input/output cycle. One could even say that while the mi- 
cro-level is sub-symbolic and quantitative in that it is expressed by the vector of activation 
states observed in each cycle in the network’s output units, the macro-level is symbolic and 
qualitative in that humans use language to generate descriptions of macro-actions. Neural 
networks are said to be subsymbolic and quantitative systems and in any case the particu- 
lar networks we are discussing do not have language. Hence, micro-actions seem to be re- 
al for them but macro-actions don’t. 

We will describe some simulations that attempt to show that both micro-and macro-ac- 
tions are real for neural networks and that neural networks have an explicit, and not on- 
ly an implicit, knowledge of macro-actions. We train the neural networks to reach for ob- 
jects. After the desired behavior has been acquired we examine the internal organization 
of individual networks using two different methods: we measure the activation level of 
the network’s internal units in response to each possible input and we lesion the net- 
work’s internal units and connections and observe the changes in behavior that result 
from these lesions. From these analyses we conclude that both micro-actions and macro- 
actions are real for neural networks in that both micro-actions and macro-actions are ex- 
plicitly represented in the networks’ internal structure. 



2.1 SIMULATIONS 

An artificial organism lives in a bidimensional world which contains only two objects, 
object A and object B (Figure 1). 

A B 



Figure 1. The two objects 



At any given time the organism sees either only one of the two objects, A or B, or both 
objects at the same time. When a single object is seen, the object may appear either in 
the left or in the right half of the organism’s visual field. When the organism sees the two 
objects together, object A can appear in the left visual field and object B in the right field, 
or viceversa. The possible contents of the organism’s total visual field at different times 
are shown in Figure 2. 
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Figure 2. At any given time the organism sees one of these six visual scenes 



The organism has a single two-segment arm with which the organism can reach the ob- 
jects by moving the arm in such a way that the arm’s endpoint (the hand) eventually ends 
up on an object. The objects are always located within reaching distance from the or- 
ganism (Figure 3). When a single object is presented the organism has to reach for it in- 
dependently from the location of the object in the left or right field and independently 
from whether the object is A or B. When both object A and object B are presented the 
organism has always to reach for object A, ignoring object B, independently from 
whether object A is in the left or in the right field. In summary, for the first three visual 
patterns of Figure 2 the organism has to reach for the object on the left side of its visual 
field, whereas for the last three visual patterns it has to reach for the object on the right 
side of its visual field. 




Figure 3. The organism with its total visual field and its two-segment arm. The organism is currently seeing an 
A object in the left field. 
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How should the organism's nervous system be internally organized so that the organ- 
ism is able to exhibit this kind of behavior? We will try to answer this question by sim- 
ulating the organism’s nervous system using a neural network and the acquisition of the 
behavior we have described using a genetic algorithm for selecting the appropriate con- 
nection weights for the neural network. 

The neural network that controls the organism’s behavior has one layer of input (senso- 
ry) units, one layer of output (motor) units, and one layer of internal units. The input lay- 
er includes two distinct sets of units, one for the visual input and one for the propriocep- 
tive input which tells the network what is the current position of the arm. The organism has 
a ‘retina’ divided up into a left and a right portion. Each portion is constituted by a small 
grid of 2x2=4 cells. Hence the whole retina is made up of 4+4=8 cells. Each cell of the reti- 
na corresponds to one input unit. Hence, there are 8 visual input units. These units can have 
an activation value of either 1 or 0. An object is represented as a pattern of filled cells ap- 
pearing in the left or in the right portion of the retina (cf. Figure 1). The cells occupied by 
the pattern determine an activation value of 1 in the corresponding input unit and the emp- 
ty cells an activation value of 0. The proprioceptive input is encoded in two additional in- 
put units. These units have a continuous activation value that can vary from 0 to 3.14 cor- 
responding to an angle measured in radiants. The organism’s arm is made up of two seg- 
ments, a proximal segment and a distal segment (cf. Figure 3). One proprioceptive input 
unit encodes the current value of the angle of the proximal segment with respect to the 
shoulder. The other proprioceptive unit encodes the value of the angle of the distal segment 
with respect to the proximal segment. In both cases the maximum value of the angle is 180 
degrees. The current value of each angle is mapped in the interval between 0 (0° angle) and 
3.14(180° angle) and this number represents the activation value of the corresponding pro- 
prioceptive unit. Since the visual scene does not change across a given number of succes- 
sive input/output cycles whereas the organism moves its arm during this period of time, the 
visual input for the organism remains identical during this period of time but the proprio- 
ceptive input may change if the organism moves its arm. 

The network’s output layer is made up of two units which encode the arm's movements, 
one unit for the proximal segment and the other unit for the distal segment. The activation 
value of each output unit varies continuously from 0 to 1 and is mapped into an angle which 
can vary from -10° to +10°. This angle is added to the current angle of each of the arm’s 
two segments resulting in a movement of the arm. However, if the unit’s activation value 
happens to be in the interval between 0.45 and 0.55, this value is mapped into a 0° angle, 
which means that the corresponding arm segment does not move. Hence, after moving the 
arm in response to the visual input for a while, the network can decide to completely stop 
the arm by generating activation values between 0.45 and 0.55 in both output units. 

The 8 visual input units project to a layer of 4 internal units which in turn are connected 
with the 2 motor output units. Therefore the visual input is transformed at the level of the 
internal units before it has a chance to influence the motor output. On the contrary, the pro- 
prioceptive input directly influences the motor output. The two input units that encode the 
current position of the arm are directly connected with the two output units which determine 
the arm's movements. The entire neural architecture is schematized in Figure 4. 
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Motor output 




Visual input Proprioceptive 

input 



Figure 4. Neural network architecture of the organism 



How a neural network responds to the input depends on the network’s connection 
weights. To find the connection weights that result in the behavior of reaching for an A 
or B object when presented alone and reaching for the A object when presented togeth- 
er with the B object, we have used a genetic algorithm, a computational procedure that 
mimics evolutionary change (Holland, 1975). An initial population of 100 neural net- 
works is created by assigning randomly selected connection weights to the neural net- 
works. Each individual network lives a life of a maximum of 600 time steps (input/out- 
put cycles) divided up into 10 epochs of 60 time steps each. (An epoch is terminated 
when an object is reached.) During each epoch one of the six possible visual inputs of 
Figure 2, randomly chosen, is presented to the individual and this visual input remains 
the same during the entire epoch. However, since the organism can move its arm to reach 
the object, the proprioceptive input can vary during an epoch. Moreover, since at the be- 
ginning of an epoch the arm is positioned in a randomly selected starting position, the 
initial proprioceptive input varies in each epoch. 

At the end of life each individual is assigned a total fitness value which is the average 
of the fitness values obtained by the individual in each of the 10 epochs. An epoch’s fit- 
ness value is +1 if the correct object has been reached and is -1 if the incorrect object has 
been reached, i.e., if the organism reaches the B object with the A object also present. An 
object is considered as reached if, when the arm stops, the arm’s endpoint happens to be 
located within 10 pixels from the object’s center, i.e., from the point in which the two lit- 
tle squares that make up the object touch each other. Furthermore, the fitness value of 
each epoch is reduced by a small amount which increases with the squared distance be- 
tween the point in which the arm stops and the object’s center. In other words, an indi- 
vidual is rewarded for stopping its arm as close as possible to the object’s center. 
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The fitness formula is the following: 

10 

(nCorrect - nlncorrect) - k X Distance; 2 

i=l 

Fitness = 

10 

where k = 0.001, nCorrect = number of objects correctly reached, nlncorrect = number of 
objects incorrectly reached, Distancej = distance between target and final hand position 
for the epoch i, 10 = number of epochs. If in a particular epoch the distance is greater than 
100 pixels or the arm does not stop, the distance is considered as equal to 100 pixels. 

The 20 networks with the highest fitness are selected for reproduction. The weight val- 
ues of all the connections of an individual neural network are encoded in the network’s 
genotype. A network which is selected for reproduction generates 5 copies of its genotype 
and each copy is assigned to one of 5 new networks (offspring). Each copy is slightly 
modified by adding a quantity randomly selected in the interval between +1 and -1 to the 
current value of 10% (on average) of the weights (genetic mutations). The 20x5=100 new 
networks constitute the next generation. The process is repeated for 10,000 generations. 

In the early generations the behavior of the organism is not very good but the selective 
reproduction of the best individuals and the constant addition of new variants by the ge- 
netic mutations (reproduction is nonsexual) result in a progressive increase in the aver- 
age fitness of the population so that after a certain number of generations most individ- 
uals in the population exhibit the behavior we have described (Figure 5): when an indi- 
vidual sees a single object, it reaches the object whether it is an A or B object and 
whether the object appears in its left or right field; when the organism perceives two ob- 
jects at the same time, it reaches the A object and ignores the B object both when the A 
object is in the left field and when it is in the right field. 




Figure 5. The organism correctly reaches the B object presented alone in the left visual field (left) and the A 
object presented in the right visual field together with the B object in the left visual field (right) 
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Figure 6 shows the increase in fitness across 10,000 generations for the single best in- 
dividual and for all individuals in each generation. The results are the average of 10 
replications of the simulations starting with randomly selected initial conditions (differ- 
ent “seeds” for the initial assignment of connection weights, for the initial starting posi- 
tion of the arm in each trial, etc.). 




Generations 

Figure 6. Increase in fitness across 10,000 generations for the single best individual and for all the individuals 
in each generation (average). The maximum fitness value is 1, which means that in all epochs the correct ob- 
ject has been reached and the arm’s endpoint stops exactly on the center of the object. 



The robustness of the behavior which has been acquired is demonstrated by a general- 
ization test in which an individual is exposed to all 6 possible visual inputs and to 5 ran- 
domly chosen initial positions of the arm for each of the 6 visual inputs, for a total of 30 
different inputs. The result of this test, which has been conducted on all the individuals 
of the last generation for all 10 replications of the simulation, show that the 20 best in- 
dividuals, that is, those individuals which are selected for reproduction, correctly reach 
the target object almost always. 



2.2 ANALYSIS OF THE INTERNAL ORGANIZATION OF THE NEURAL NETWORKS 

To determine how the neural networks of our organisms are internally organized as a 
result of evolution in order to exhibit the behavior which has been described, we meas- 
ure the activation level of each of the 4 units of the internal layer of the organisms’ neu- 
ral networks in response to each of the six visual stimuli. This has been done for 10 in- 
dividuals, i.e., the best individuals of the last generation in each of the 10 replications of 
the simulation. The results are shown in Figure 7. 
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Figure 7. Activation level of the 4 internal units (columns) in response to the six visual stimuli (rows) for the 
single best individual of the last generation in each of the 10 replications of the simulation. Although the acti- 
vation level is continuous and is mapped into a grey scale (0 = black and 1 = white), observed activation lev- 
els are quite extreme. 



The results of this analysis show that in most individuals there are three types of inter- 
nal units. The first type of internal units (black columns in Figure 7) exhibit an activa- 
tion level of near zero in response to all possible visual inputs. In other words, these units 
play no role in determining the neural network’s output, i.e., the arm movements. They 
appear to be useless, at least at the end of evolution. Notice however that these units may 
play a role during the course of evolution even if they play no role at the end. In fact, if 
we reduce the number of internal units the same terminal level of performance is even- 
tually reached but it is reached more slowly. In other words, ‘starting big’, i.e., with more 
computational resources, may help even if at the end of evolution some of the resources 
are not used (Miglino and Walker, 2002). 

The second type of internal units are units which are invariably highly activated in re- 
sponse to all possible visual inputs (white columns). The activation level of these units 
also does not co-vary with the input, exactly like the zero activation units we have al- 
ready described, but these units do have a role in determining the observed behavior. By 
being constantly activated with an activation level of almost 1 they influence the motor 
output of the network through the particular weight values of the connections linking 
them to the output units. 

The third and final type of internal units (black and white columns) are those units that 
have an activation level which is not always the same but varies as a function of the vi- 
sual input. However, the activation level of these units cannot be said to vary with the 
visual input in the sense that each of the 6 different visual inputs elicits a different acti- 
vation level in these units. On the contrary, these units tend to exhibit one of only two 
different activation levels (and two rather extreme activation levels since one is near ze- 
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ro and the other one near 1) while there are 6 different visual inputs and, furthermore, 
these two activation levels are correlated with the network’s motor output rather than to 
the network’s visual input. If we examine the organism’s behavior we see that these in- 
ternal units have one particular activation level, say 1, when the network is responding 
to the visual input by moving the arm towards the left visual space, and they have a very 
different activation level, i.e. 0, when the movement of the arm is toward the right visu- 
al space. Notice that our organisms ‘move their arm to the left’ in response to different 
visual inputs, i.e., both when there is a single object, A or B, in the left field and when 
the A object is in the left field and the B object is in the right field. Similarly, they ‘move 
their arm to the right’ both when there is a single object, A or B, in the right field and 
when the A object is in the right field and the B object in the left field. Hence, this third 
type of internal units tend to reflect the motor output of the network rather than the sen- 
sory input. More precisely, they reflect (encode) the macro-actions with which the or- 
ganism responds to the different sensory inputs. 



2.3 LESIONING THE NEURAL NETWORKS 

Another type of analysis that may reveal the internal organization of our neural net- 
works consists in lesioning individual internal units and observing the type of disturbed 
behavior that emerges as a consequence of these lesions. When an internal unit is le- 
sioned all the connections departing from the unit are cut and therefore the lesioned unit 
ceases to have any role in determining the activation level of the output units and the or- 
ganism’s behavior. We have lesioned one internal unit at a time of the neural networks 
of the same individuals already examined in the previous section, i.e., the 10 individuals 
which are the best ones of the last generation in each of the 10 replications of the simu- 
lation. 

If we lesion the internal units of the first type, i.e., those units which have a constant 
activation level of 0, there are no consequences for the organism’s behavior, both when 
the object to be reached is in the left portion of the visual field and when it is in the right 
portion. This is not surprising since these units play no role in determining the organ- 
isms’ response to the input and therefore lesioning these units has no damaging effects 
on the organisms’ behavior. 

If we lesion the second type of internal units, those with a constant activation level of 
near 1, the negative consequences for the organism’s behavior are always very serious 
and equally distributed across all types of visual inputs and behavioral responses. The 
organism appears to be completely unable to reach the objects whatever the position, 
type, and number of objects in its visual field (Figure 8). More specifically, the arm fails 
to reach the portion of the total space which is visually perceived by the organism (vi- 
sual space) and in which the objects are found. In other words, when these units are le- 
sioned, the organism appears to be unable to execute the macro-action which consists in 
“reaching the visual space”. 
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Figure 8. Lesioning an internal unit with a constant activation level of 1 completely disrupts the ability (macro- 
action) to reach the portion of the space which is visually perceived. The figure shows for a particular individ- 
ual where the arm stops in response to two sample visual inputs. 



A very different type of behavioral damage appears if we lesion the third, selective, type 
of internal units, i.e., those whose activation level co-varies with the two macro-actions 
“move the arm toward the left field” and “move the arm toward the right field”, respective- 
ly. In 9 out of the 10 individuals there is only one unit of this type, Lesioning this unit leads 
to a form of stereotyped behavior: for different individuals, whatever the visual input, either 
the organism always moves its arm to the left portion of the visual space or it always moves 
the arm to the visual space’s right portion. Hence, in half the epochs the organism’s arm 
reaches the correct object and in the remaining epochs it either reaches the wrong object 
(two objects are presented) or the wrong portion of the visual space (only one object is pre- 
sented) (Figure 9). This appears to be a fortuitous result of the particular position in which 
the object happens to be located, that is, of whether the correct object happens to lie in the 
portion of the visual space always reached by the stereotyped behavior of the organism. 




Figure 9. Behavior after lesioning an internal unit whose activation level varies with the visual input in the 9 
out of 10 individuals in which there is only one unit of this type. The figure shows where the arm stops in re- 
sponse to two sample visual inputs for an individual in which the internal unit has an activation level of 0 en- 
coding the macroaction “reaching toward the left field” and an activation level of 1 encoding the macro-action 
“reaching toward the right field”. 
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From the results of these analyses we conclude that the internal units of our networks 
encode macro-actions. The activation level of the internal units co-varies with the 
macro-actions of the organism and lesioning these internal units leads to disruption of 
entire macro-actions. 

That the internal layer which receives input from the retina and which therefore con- 
structs internal representations of the visual input, encodes macro-actions can also be 
shown by contrasting the effects of lesions to the units comprising this internal layer with 
the effects of lesions to the proprioceptive-to-motor pathway. While the visual-to-motor 
pathway encodes macro-actions, the proprioceptive-to-motor pathway encodes micro- 
actions. In other words, the internal layer receiving visual information from the retina 
tells the network what to do at the macro level, for example “move the arm toward the 
left portion of the visual space”, while the connection weights from the proprioceptive 
input units to the motor output units tell the network what to do at the micro level, that 
is, how to actually implement the macro-action “move the arm toward the left portion of 
the visual space” given the current and constantly changing position of the arm. 

As we have seen, lesions to the visual-to-motor pathway disrupt entire macro-actions. 
What kind of damage results from lesioning the proprioceptive-to-motor pathway? Since 
there are no internal units in the proprioceptive-to-motor pathway but the proprioceptive 
input units directly project to the motor output units, we have lesioned this pathway by 
introducing some ramdom noise in it. We have added a quantity randomly selected in the 
interval between -0.2 and +0.2 to the current value of each of the four connection 
weights linking the 2 proprioceptive input units to the 2 motor output units. The result 
of this operation is that the behavior of the 10 individuals appears seriously damaged 
(the percentage of correct responses is 0% for both portions of the visual space) but the 
behavioral deficit is very different from the deficit observed with lesions to the visual- 
to-motor pathway. If, for example, the visual input requires reaching for the object in the 
left visual field, an individual with lesions in the proprioceptive-to-motor pathway is still 
able to move its arm toward the left visual field (which implies that the macro-action is 
preserved) but the arm stops when the arm’s endpoint is still somewhat distant from the 
object. Hence, the object is not actually reached (Figure 10). 





Figure 10. The figure shows where the arm stops after lesioning the proprioceptive-to-motor pathway. The in- 
dividual’s macro-actions are still intact but their precise realization is disrupted. 
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This systematic behavioral deficit which is observed after lesioning the proprioceptive- 
to-motor pathway shows that the organisms still know what is the macro-action that 
should be produced (and which is encoded in the intact visual-to-motor pathway) but are 
unable to correctly realize this macro-action at the micro level of the specific changes in 
successive arm positions because the proprioceptive information that specifies the cur- 
rent arm position is disturbed. 

We have measured the average distance from the target in normal, i.e., not lesioned, or- 
ganisms, in organisms with lesions to the proprioceptive-to-motor pathway, and in or- 
ganisms with lesions to the internal units that are always activated. Both after lesions to 
the proprioceptive-to-motor pathway and to the always activated units the performance 
in terms of correct responses is completely disrupted, i.e., objects are never reached, but 
whereas in the latter case the arm’s endpoint stops very far from the target, in the former 
case the arm’s endpoint stops relatively close to the target (Figure 11). 




Figure 11. Average distance of the arm’s endpoint from the target in normal individuals, in individuals with le- 
sions to the proprioceptive-to-motor pathway, and in individuals with lesions to the always activated units. 
While after lesions to the latter type of units the individuals are completely unable to reach the portion of the 
visual space where the objects are found, after lesions to the propioceptive-to-motor pathway they are still able 
to move their arm toward the correct portion of the visual space, but the arm’s endpoint stops when it is more 
or less removed from the object. 



3. FROM VISUAL INPUT TO MOTOR OUTPUT 

The activation patterns observed in the internal units of a neural network when some 
input arrives to the network’s input units can be called the network’s internal represen- 
tations. However, although it is the sensory input that causes these internal representa- 
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tions, the simulations we have described demonstrate that a neural network’s internal 
representations tend to reflect the properties of the motor output (at the level of the 
macro-actions) with which the network responds to the sensory input rather than those 
of the sensory input itself. The activation patterns observed in the internal units of our 
organisms co-vary with the macro-action with which the organism responds to a vari- 
ety of sensory inputs rather than with these different sensory inputs. In other words, 
networks have action-based internal representations rather than sensory-based repre- 
sentations. 

This appears to be a logical consequence of the intrinsic nature of neural networks. In 
very general terms neural networks can be viewed as systems for transforming activation 
patterns into other activation patterns. An external cause produces an activation pattern in 
the network’s input units and then the network’s connection weights transform this activa- 
tion pattern into a succession of other activation patterns in the network’s successive lay- 
ers of internal units until the output activation pattern is generated. If we assume that the 
input pattern encodes the characteristics of some visual stimulus and the output those of 
some motor response, the successive activation patterns will progressively reflect less and 
less the characteristics of the visual input and more and more the characteristics of the mo- 
tor output, and the network’s internal representations will become more and more action- 
based. In this Section we examine how internal representations become progressively less 
sensory-based and more action-based. 

In the simulations described in the first Section there was a single layer of internal units 
and we have seen that this layer of internal units encodes macro-actions rather than vi- 
sual information. We might be able to observe a more gradual mapping of visual into 
motor information if we provide our neural networks with a succession of layers of in- 
ternal units rather than a single internal layer. For example, with two layers of internal 
units we should be able to observe that the first layer, i.e., the layer which is closer to the 
input layer, reflects more closely the characteristics of the visual input, whereas the sec- 
ond layer, i.e., the layer which is closer to the output layer, will reflect the characteris- 
tics of the motor output. 



3.1 SIMULATIONS 

We have run two new simulations using the same task as before but two new network 
architectures both with two successive layers of internal units, not only one as in the pre- 
ceding simulation. Both internal layers contain 4 units but in one simulation the entire 
retina projects to all the units of the lower internal layer (Figure 12a) whereas in the oth- 
er simulation the lower layer is divided up into two separate sets of 2 units, and the left 
half of the retina projects to the first 2 units and the right half to the other 2 units (Figure 
12b). All the 2+2=4 units of the lower internal layer send their connections to all the 4 
units of the higher internal layer, which are connected to the output units. The two new 
architectures are schematized in Figure 12. 
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Motor output 




Visual input 



Proprioceptive input 



Motor output 




Visual input Proprioceptive input 



Figure 12. The two new network architectures used for the reaching task 



Both these more complex architectures learn the task equally well as the basic architec- 
ture with a single layer of internal units. In both cases the evolutionary curve of perform- 
ance across 10,000 generations is practically indistinguishable from that of Figure 6. 



3.2 ANALYSIS OF THE INTERNAL ORGANIZATION OF THE NEURAL NETWORKS 

As we have done for the previous simulation, we examined the activation level of the 
internal units of the best individual of the last generation in each of the 10 replications 
of the two new simulations. There are 8 internal units in the new architectures, 4 in the 
lower layer and 4 in the higher layer. The results of this examination are shown in Fig- 
ures 13 and 14. 
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Figure 13. Activation level of the 4 units of the higher internal layer (a) and of the 4 units of the lower internal 
level (b) in response to each of the 6 visual input patterns, for the architecture in Figure 12a. 
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seed 6 seed 7 seed 8 seed 9 seed 10 
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Figure 14. Activation level of the 4 units of the higher internal layer (a) and of the 2+2=4 units of the lower in- 
ternal level (b) in response to each of the 6 visual input patterns, for the architecture in Figure 12b. 
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The two figures show that the higher level of internal units, the “pre-motor” units, en- 
code macro-actions in the same way as the single layer of internal units in the previous 
simulation. As in the previous simulation, there are three types of internal units, those 
with (almost) zero activation in response to all 6 input patterns, those with a constant ac- 
tivation of (almost) 1 for all input patterns, and those with an activation of (almost) 0 en- 
coding the macroaction of “moving the arm toward the left (or right) portion of the vi- 
sual space” and an activation level of (almost) 1 encoding the macroaction of “moving 
the arm toward the right (left) portion of the visual space”. 

However, when we turn our attention to the internal units of the lower layer, which is 
closer to the visual input, things look differently. Although there are some units with either 
an almost constant 0 or 1 activation level, most units are selective, i.e., their activation lev- 
el varies but it does not vary with the motor output. In fact, there is no encoding of macro- 
actions at this level of internal units. As a consequence, when we lesion the units of this 
lower layer there is no predictable and systematic behavioral damage. 

What do the internal units of the lower layer encode, then? This question can be an- 
swered more clearly if we look at the lower layer of the second architecture, which is di- 
vided up into two separate sets of 2 units each (Figure 12a). In the other architecture in 
which the entire retina projects to all the 4 internal units of the lower layer (Figure 12b), 
it is difficult to interpret what is encoded in these units because each internal unit re- 
ceives information from both the left and the right visual fields and the information 
which must be extracted from the entire retina is rather complex. On the other hand, 
when we examine the architecture in which the lower internal layer is divided up into 
two separate sets of 2 units which receive information from the left field and from the 
right field, respectively, it becomes clear that the lower layer of internal units encodes 
what is present in the retina. At the input level each portion of the retina may contain ei- 
ther an A object or a B object or nothing. These three different possibilities tend to be 
encoded in the internal units each with a distinct distributed activation pattern. This in- 
formation is then fed to the higher layer of internal units which integrates the informa- 
tion from both fields and, as we know, generates an encoding in terms of macro-actions. 

These simulations seem to imply that the layers of internal units which are closer to the vi- 
sual input tend to produce internal representations which reflect the properties of the visual 
input as such, independently from the actions with which the neural network will respond to 
the visual input, and it is only the internal representations in the layers which are closer to 
the motor output which reflect the properties of the motor output. In the next Section we de- 
scribe a third set of simulations that show that the situation is more complex and that all pro- 
cessing of the visual input inside the neural network reflects and prepares the motor output. 

4. INTERNAL REPRESENTATIONS REFLECT THE CONTEXT THAT DICTATES 
WITH WHICH ACTION TO RESPOND TO THE SENSORY INPUT 

In the simulations described sofar the macro-action with which the organism must re- 
spond to the visual input depends exclusively on the visual input. Given some particular 
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visual input the network always responds with the same macro-action. But consider a 
somewhat more complex situation in which the context also has some role in determin- 
ing the organism’s response (for experiments, cf. Barsalou, 1983; 1991; for simulations, 
cf. Borghi, Di Ferdinando, and Parisi, in press; Di Ferdinando, Borghi, and Parisi, 2002). 
The context can be some internal motivational state of the organism or an external ver- 
bal command. Given the same visual input the organism responds with one macro-ac- 
tion if the context is X and with a different macro-action if the context is Y. What we call 
context is an additional input which arrives to the network and interacts with the visual 
input to determine the organism’s behavior. What are the consequences of this contextu- 
al information for the neural network’s internal representations? 



4.1 SIMULATIONS 

As in the preceding simulations the organism has a total visual field of 2x4=8 cells di- 
vided into a left portion and a right portion of 2x2=4 cells each. However, unlike the pre- 
ceding simulations the organism always sees two objects at the same time, either two ob- 
jects of the same shape (two A objects or two B objects) or one A object and one B ob- 
ject, with the A object in the left portion of the visual field and the B object in the right 
portion, or viceversa. Hence, the organism’s retina can encode one of the 4 visual scenes 
represented in Figure 15. 



3 



4 











Figure 15. The content of the organism’s visual field 



The organism has the same two-segment arm which it uses for reaching one of the two 
objects. However, in the new simulations the object to be reached in any given trial de- 
pends on a command the organism receives from outside which can tell the organism to 
reach either the object on the left or the object on the right. 

The neural network has three separate sets of input units. The first set (8 units) en- 
codes the current content of the visual field and the second set (2 units) encodes the pro- 
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prioceptive input which specifies the arm’s current position. This is like the previous 
simulations. The third set (2 units) is new. These units are called “command units” and 
they encode two possible commands to the organism to reach with its arm either the ob- 
ject in the left portion of the visual field (encoded as the activation pattern 10) or in the 
right portion (encoded as 01). The internal architecture of the organism’s neural net- 
work is composed of two successive layers of internal units. The lower layer of 4 in- 
ternal units is divided up into two sets of 2 units each separately encoding the content 
of the left and right portion of the organism’s visual field. All 4 units of the lower lay- 
er project to all 4 units of the higher layer. Therefore, the visual information from both 
the left and right half of the visual field, separately elaborated by the lower layer of in- 
ternal units, is put together at this higher layer of internal units. Finally, all the 4 units 
of the higher layer project to the 2 output units which encode the arm’s micro-actions. 

We have applied two experimental manipulations. As in the preceding simulations, the 
2 proprioceptive input units completely by-pass the internal units and are directly con- 
nected with the output units. On the contrary, for the 2 input units encoding the two com- 
mands “Reach the object on the left” and “Reach the object on the right”, we have adopt- 
ed two different network architectures in two separate simulations. In one simulation the 
command units project to the lower layer of internal units (Low Command, Figure 16a) 
and in the other simulation they project to the higher layer of internal units (High Com- 
mand, Figure 16b). This implies that in the Low Command condition the lower layer of 
internal units elaborates the visual input already knowing the action to be executed in re- 
sponse to the visual input, whereas in the High Command condition the command infor- 
mation becomes available to the neural network only at the higher internal layer and the 
lower internal layer must process the visual input while still ignoring what to do with re- 
spect to it. The two network architectures are shown in Figure 16. 



Motor input 





(a) 



(b) 



Figure 16. The two network architectures used in separate simulations 
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The second experimental manipulation concerns the manner in which the organism has 
to grasp the two objects, object A and object B, after reaching them. We assume that the 
objects have an “handle” which is used for grasping them. In one condition (Same Han- 
dle) both objects have the same handle, which is located where the two filled cells that 
represent an object touch each other. Hence, both A and B objects are grasped in the 
same manner, that is, by putting the arm’s endpoint (the “hand”) on the point of contact 
between the two filled cells. In the other condition (Different Handle) the objects have 
handles located differently: A objects are grasped by putting the “hand” on the higher 
cell and B objects are grasped by putting the "hand” on the lower cell (Figure 17). 

Same 
handle 

Different 
handle 

Figure 17. In the Same Handle condition both objects are grasped by putting the “hand” on the point of con- 
tact between the filled cells whereas in the Different Handle condition objects of type A are grasped by putting 
the hand on the higher cell and objects of type B are grasped by putting the hand on the lower cell. 









One consequence of this experimental manipulation is that in the Same Handle condi- 
tion the organism’s neural network can ignore the difference in shape between the two 
types of objects since the different shape does not affect the nature of the motor action 
the organism must execute in grasping the objects. On the contrary, in the Different Han- 
dle condition the shape of the object must be recorded and elaborated by the organism’s 
neural network since objects of shape A require a different type of motor action (grasp- 
ing) from objects of shape B. 

In total we have four experimental conditions: (1) Low Command/Same Handle; (2) 
Low Command/Different Handle; (3) High Command/Same Handle; (4) High Com- 
mand/Different Handle. For each of the four experimental conditions we have run 10 
replications of the simulation using the genetic algorithm with different initial condi- 
tions. In all 4 experimental conditions our organisms acquire the ability to respond ap- 
propriately. At the end of the simulation the best organisms are able to respond by 
reaching for the appropriate object and grasping it in the required manner. We now ex- 
amine the internal organization of our organisms’ neural networks. 
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4.2 ANALYSIS OF THE INTERNAL ORGANIZATION OF THE NEURAL NETWORKS 



For each of the four experimental conditions, we examine the activation level of the 2 
layers of internal units of the best individual of the last generation in each of the 10 repli- 
cations. For the higher layer, which is closer to the motor output, the results are identi- 
cal to what we have found in the previous simulations: this layer encodes the macro-ac- 
tions to be produced by the organism. In the two Same Flandle Conditions there are on- 
ly two macro-actions: “reach and grasp the object on the left” and “reach and grasp the 
object on the right”. On the contrary, in the two Different Handle Conditions there are 
four different macro-actions: “reach and grasp the A object on the left”, “reach and grasp 
the B object on the left”, “reach and grasp the A object on the right” and “reach and grasp 
the B object on the right”. Correspondingly, we find two different activation patterns in 
the higher internal layer in the two Same Handle Conditions and four different activa- 
tion patterns in the two Different Handle Conditions. 

Let’s now turn to the lower layer of internal units, which is closer to the visual input 
and which therefore should reflect the properties of the visual input rather than those of 
the motor output. Figure 18 shows the results of our analysis for 4 individuals, one for 
each experimental condition. 
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Figure 18. Activation patterns observed in the lower layer of internal units in response to the four visual stimuli 
when the command is “Reach and grasp the object on the left” (first four rows) and when the command is “Reach 
and grasp the object on the right” (last four rows), in the Low Command/Same Handle Condition (a), in the Low 
Command/Different Handle Condition (b), in the High Command/Same Handle Condition (c), and in the High 
Command/Different Handle Condition (d). The results are from 4 individuals, one for each condition. 
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What Figure 18 tells us is that the properties of the visual input are represented in dif- 
ferent ways in the four conditions. In particular, two main results emerge: 

(1) The internal representations of the two objects A and B are different only in the Dif- 
ferent Handle Conditions. In the Same Handle Conditions, both when the command ar- 
rives to the lower layer (Low Command) and when the command arrives to the higher 
layer (High Command), there is no difference between objects A and B. 

2) However, even in the two Different Handle Conditions, when the command arrives 
to the lower layer (Low Command) there is a difference between objects A and B only 
in the two internal units connected to the portion of the visual field which contains the 
object to be reached and grasped, while when the command arrives to the higher layer 
(High Command) the difference appears in both pairs of units. 

Let us try to explain these results. In the two Same Handle Conditions the two objects 
have to be grasped in the same way and therefore there is no need for the neural network 
to know whether an object is A or B (Figure 18a and 18c). On the contrary, in the two 
Different Handle conditions, the two objects have to be grasped in different ways and 
therefore the neural network produces different internal representations for objects A and 
for objects B (Figure 18b and 18d). 

The correctness of this analysis can be demonstrated by comparing, for the best indi- 
vidual of each of the 10 replications of the simulation, the normalized distance between 
the internal representations of the two objects A and B in the lower layer for the High 
Command/Same Handle Condition and for the High Command/Different Handle Con- 
dition (Figure 19). Similar results are found if we compare the Low Command/Same 
Handle Condition and Low Command/Different Handle Condition. 



I O 
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Figure 19. Normalized distance between the internal representations of the two objects A and B in the lower 
layer for the High Command/Same Handle Condition and for the High Command/Different Handle Condition 
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These results show that the network represents the two objects A and B with much 
more different activation patterns in the Different Handle Condition than in the Same 
Handle Condition. Thus, also in internal layers which are close to the visual input the in- 
ternal representations tend to reflect the properties of the motor output rather than those 
of the visual input. 

However, in the Different Handle Conditions the two objects A and B are represented in 
different ways depending on the internal layer to which the command arrives. In the Low 
Command/Different Handle Condition the lower layer of internal units already knows what 
macro-action must be executed and therefore we expect that the network’s internal repre- 
sentations will reflect already at this level of processing of the visual input the requirements 
of the macro-action to be executed. If the command is “Reach and grasp the object on the 
left” A and B objects will be represented in different ways in the internal units connected to 
the left half of the visual field but they will be represented in the same way in the internal 
units connected to the right half, and vice versa if the command is “Reach and grasp the ob- 
ject on the right”. This is exactly what we observe in Figure 18b. On the contrary, in the 
High Command/Different Handle Condition the information on whether the object on the 
left or the object on the right is to be reached and grasped arrives later on to the neural net- 
work, i.e., at the level of the second (higher) layer of internal units. Therefore, in the lower 
layer of internal units, since at this stage of processing of the visual input the network still 
ignores which action is to be executed in response to the visual input, the internal represen- 
tations will reflect only the properties of the visual input and ignore the requirements of the 
actions. This is what we observe in Figure 18d. 

The robustness of this analysis can be demonstrated by comparing in the Low Com- 
mand/Different Handle Condition, for the best individual of each of the 10 replications 




Attended 



Unattended 



Hemifield 



Figure 20. Normalized distance between internal representations of the two objects in the lower layer, in the 
two internal units that are connected to the portion of the retina which contains the object to be reached (at- 
tended), and in the two internal units connected to the other portion (unattended), in the Low Command/Dif- 
ferent Handle Condition. 
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of the simulation, the normalized distance between the internal representations of the 
two objects in the lower layer, separately for the two units that are connected to the por- 
tion of the visual field where the object to be reached is found and for the two units con- 
nected to the other portion (Figure 20). 

This analysis shows that the network represents the two objects A and B with much 
more different activation patterns in the two internal units that are connected to the por- 
tion of the visual field which contains the object to be reached (attended hemifield) than 
in the two units connected to the other portion (unattended hemifield). Thus, not only the 
manner in which neural networks represent the visual input depends on the requirements 
of the actions with which the organism must respond to the visual input rather than on 
the visual input’s intrinsic properties, but these representations can vary in the same neu- 
ral network according to the particular task in which the organism is currently involved. 



5. CONCLUSIONS 

This chapter has been concerned with the question of what internal representations un- 
derlie the ability of organisms to respond to sensory input with the appropriate motor 
output. The behavior of organisms is controlled by their nervous system and we have 
modelled the nervous system using artificial neural networks. Neural networks can be 
trained to exhibit desired behaviors (sensory input/motor output mappings) and then one 
can examine their internal representations, i.e., the activation patterns caused by the sen- 
sory input in the network’s internal units. 

We have described a series of computer simulations that allow us to draw the follow- 
ing conclusions. 

Both macro-actions (meaningful sequences of micro-movements that achieve some 
goal for the organism) and micro-actions (the micro-movements that make up a macro- 
action) are real for neural networks. Micro-actions are the activation patterns observed 
in each input/output cycle in the network’s motor output units. Macro-actions are ob- 
served internal representations which remain the same during an entire succession of in- 
put/output cycles and which, together with the constantly changing proprioceptive feed- 
back, control the succession of micro-actions until a goal has been reached. 

Internal representations tend to reflect the properties of the motor output (macro-ac- 
tions) with which the organism respond to some particular visual input rather than those 
of the visual input itself. It is the visual input that is the immediate cause of the internal 
representations but it is the motor output that, through the adaptive history of the organ- 
isms, dictates the form of these representations. The properties of the visual input are re- 
tained in the internal representations only in so far as they are relevant for the action to 
be executed in response to the visual input. 

Of course, if the neural network includes a succession of internal layers the early layers 
which are closer to the visual input will tend to reflect more the properties of the visual in- 
put and the later layers which are closer to the motor output the properties of the motor out- 
put. However, the internal representations of the visual input even in layers which are very 
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close to the visual input will preserve of the visual input only those properties that are rel- 
evant for the action with which the organism must respond to the input. If the same visual 
input may elicit different actions depending on the context, the internal representations of 
the visual input will vary as a function of the current context and therefore of the contex- 
tually appropriate action which has to be generated. The critical result is the result for the 
two High Command conditions in the simulations of Section 4. Even if at the lower level 
the network still does not know what action must be generated in response to the visual in- 
put, the internal representation of the visual input at this lower level does not simply reflect 
the properties of the visual input. If the adaptive pattern of the organism, which is the cur- 
rent result of the adaptive history of the organism (genetic algorithm), requires the organ- 
ism to respond in the same way to A and B objects (Same Handle condition), A and B ob- 
jects will be represented in the same way at this lower level (Figure 18 (c)). But if the adap- 
tive pattern requires the organism to respond in two different ways (Different Handle con- 
dition), A and B objects will be represented in two different ways (Figure 18 (d)). We con- 
clude that there is no neutral representation of the sensory input in sensory-motor net- 
works, a representation which simply reflects the properties of the sensory input, but all 
representations of sensory input, at all levels, are informed by the requirements of the ac- 
tion with which the organism must respond to the sensory input. The visual input is not in- 
ternally represented in a fixed way which only reflects its intrinsic properties but it appears 
to be flexible and adaptable to the current needs of the organisms and to the specific action 
with which the organism must respond to the input given the particular context. 



6. CODA 

Gaetano Kanizsa wanted to keep perception separate from cognition because, at the 
time he was writing, there was a real danger that perception could be assimilated to cog- 
nition as symbol manipulation, whereas Kanizsa thought that perception is a (physical?) 
dynamical system. Our approach assimilates perception not to cognition but to action 
and it interprets everything which takes place in the mind (i.e., in the brain) as a physi- 
cal dynamical system. 
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MOVEMES FOR MODELING BIOLOGICAL MOTION PERCEPTION 



1. INTRODUCTION 

In 1973 Gunnar Johannson [18] discovered the surprising ability of our visual system 
in perceiving biological motion, i.e. the motion of the human body, even from highly 
empoverished and ambiguous stimuli. In attempting to develop models for explaining 
the perception of biological motion we examine models developed for different purpos- 
es in two areas of engineering, computer vision and computer graphics. 

Perceiving human motion, actions and activities is as important to machines as it is to 
humans. People are the most important component of a machine’s environment. En- 
dowing machines with biologically-inspired senses, such as vision, audition, touch and 
olfaction appears to be the best way to build user-friendly and effective interfaces. Vi- 
sion systems which can observe human motion and, more importantly, understand hu- 
man actions and activities, with minimal user cooperation are an area of particular im- 
portance. 

Humans relate naturally with other humans, therefore animated characters with a hu- 
man appearance and human behavior (as well as, of course, other creatures inspired to 
animals) are an excellent vehicle for communication between machines and humans. Ac- 
curate rendering techniques developed in computer graphics and ever increasing com- 
puter performance make it possible to render scenes and bodies with great speed and re- 
alism. The next frontier in computer graphics is animating automatically and interac- 
tively characters to populate screentops, web pages and videogames. 

In sum: the next generation of both animation and vision systems will need to bridge 
between the low-level data (rendered bodies, observed images) and the high-level de- 
scriptions that are convenient as either input (to graphics) or output (from computer vi- 
sion) for automatic interaction with humans. Models developed within this engineering 
effort shed light on biological motion perception since it shares the same computation- 
al underpinning. We argue that models of how humans move are the key ingredient to 
this quantum step. We start by discussing the inadequacies of the current models (sec- 
tions 2 and 3) and propose an alternative style of modeling human motion based on 
movemes, a felicitous term which we adopt from Bregler and Malik [10] and further ex- 
tend from meaning “elementary stereotypical motions” to “elementary motions para- 
meterized by goal and style” (section 4). We explore the practical aspects of movemes 
in sec. 5 and discuss six case studies in sections 6, 8, 9 and 10. The issues of styles and 
composition are discussed in sections 8 and 1 1 . The details of the models are discussed 
in the appendix. 
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2. MODELS IN MACHINE VISION 

While it is easy to agree that machines should “look” at people in order to better inter- 
act with them, it is not immediately obvious which measurements should a machine per- 
form on a given image sequence, and what information should be extracted from the hu- 
man body. There are two classes of applications: “metric” applications where the posi- 
tion of the body has to be reconstructed in detail in space-time (e.g. used as input for po- 
sitioning an object in a virtual space), and “semantic” applications where the meaning of 
an action (e.g. “she is slashing through Rembrandt’s painting”) is required. The task of 
the vision scientist/engineer is to define and measure “visual primitives” that are poten- 
tially useful for a large number of applications. These primitives would be the basis for 
the design of perceptual user interfaces [28, 14] substituting mouse motions and clicks, 
keystrokes etc. in existing applications, and perhaps enabling entirely new applications. 

Which measurements should we take? It is intuitive that if one could reconstruct frame- 
by-frame the 3D position of each part of the body one would have and excellent set of 
visual primitives. One could use such measurements directly for metric applications, and 
feed them to applications. This avenue, which we shall call puppet-tracking, has been 
pursued by a number of researchers. The most successful attempts start from the prem- 
ise that the overall kinematic structure of the human body is known, and its pose may be 
described synthetically by the degrees of freedom of the main joints (e.g. three in the 
shoulder, one in the elbow, two in the wrist etc) so that the whole body is represented in 
3D by a stick-figure not dissimilar from the wooden articulated puppets that art students 
use for training in figure-drawing. Precise knowledge of the body’s kinematics, togeth- 
er with some approximation of the dynamics of the body, allows one to invert the per- 
spective map if each limb of the body is clearly visible throughout the sequence. One 
may thus track the body both on the image plane and in 3D with come success and ac- 
curacy even using a single camera. For example: Rheg and Kanade [22] were able to 
track the movements of the fingers of a hand, even when the hand underwent complete 
rotation around the wrist. Goncalves et al. [16, 3] demostrated their monocular 3D arm- 
tracking system in real-time and used it as an interface to a rudimentary 3D vitual desk- 
top. Bregler and Malik [11] showed that they could track accurately walking human fig- 
ures in the Muybridge sequences. In the first two studies the dynamics of the body was 
approximated by a random walk. In Bregler and Malik the model is more sophisticated: 
four second-order random walks with different statistics governed by a four-state 
Markov process. 

These initial results are encouraging: they demonstrate that successful 3D tracking of 
the main degrees of freedom of the body may be achieved from monocular observations 
of a subject who is not wearing special clothing nor markers. However, they also point 
to some fundamental difficulties of the puppet-tracking approach. First of all: these 
trackers need to be initialized near the current state of the system , for example: in 
Goncalves’ case the user has to start from a standard pose and hit a button on the key- 
board to initiate tracking. Moreover, the tracker needs frequent re-initializations (every 
1000 frames or so in Goncalves et al., not reported by the other studies). This is not sat- 
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isfactory. Moreover: all these systems depend crucially on precise knowledge of the mu- 
tual position of key features and/or of the kinematics of the body, and therefore require 
an initial user-specific calibration; moreover, the user has to wear tight-fitting clothes, 
or, more entertainingly, no clothes at all in order to allow sufficient accuracy in meas- 
uring the position of body features frame-by-frame. Furthermore, occlusion has devas- 
tating effects. This is a far cry from the performance of the human visual system which 
can make sense of 3D human motion from monocular signals (e.g. TV) even when the 
actors wear loose clothing, such as overcoats, tunics and skirts, and when the signal is 
rather poor (e.g. a person seen at 100m spans fewer than 100 “pixels” in our retina). 

What is wrong with the puppet-tracking approach, and what may be the secret of the 
human visual system? 

First of all: the issue of automatic detection of humans and automatic initialization of 
human trackers has been so far ignored and needs to be addressed. We will not discuss 
this topics any further here; initial encouraging results are reported in [27, 26]. 

Second: currente models are too “brittle” for being practical. Unrealistically accu- 
rate and detailed kinematic modeling is required, clothing is a nuisance, occlusion is 
a problem. One could say that the puppet-tracking approach tries to estimate too 
much and knows too little. On one hand, one must realize that for the “semantic” ap- 
plications is should not be necessary to measure the position of all limbs frame-by- 
frame in order to interpret human activities which span many hundreds of frames On 
the other hand, random walks cannot be the ultimate model of human motion. Flu- 
mans do not move at random. 

In the following we will argue that one has to reverse the current approach and take a 
counterintuitive point of view. Rather than reconstructing the 3D pose directly from the 
image stream and then feed such signal into a recognition box for semantic analysis, one 
should first recognize the motions of the body at some discrete high-level and then (if 
necessary) reconstruct the frame-by-frame pose from these. We will argue that this ap- 
proach has several advantages: robustness, parsimony, convenience for high-level pro- 
cessing. 



3. MODELS IN ANIMATION 

It is increasingly common to see computer-generated human characters in motion pic- 
tures that are so life-like and convincing that one may be led into believing that the 
problem of automatic animation has been satisfactorily solved. This is far from the 
truth. One minute’s worth of animation can take weeks to produce and is the result of 
the manual labor of talented artists: current characters are puppets rather than actors. 
Those animated characters whose motion is synthesized automatically (surch as for a 
character on a webpage, or in a computer game) do not move realistically and have 
highly restrictive repertoires; Animating automatically realistic human characters is 
still an unsolved problem. 

The ultimate technique for automatic animation would require high-level directions 
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much like a director controls an actor (“walk towards the door expectantly”, “pick up the 
glass in a hurry”) rather than a puppetteer’s step-by-step and joint-by-joint control. It 
would produce motion patterns that would be indistinguishable from the ones obtained 
from real people. 

The automatic generation of realistic human motion has been a topic of research with- 
in the computer graphics and robotics communities for many years. Three techniques are 
used: dynamic simulation, optimization of contrastrints, and editing of motion capture 
data. 

Dynamic simulation models start from the physics: a 3-D model of a body built out of 
rigid segments with associated masses and moments of inertia is moved by torques to the 
various joints simulating muscle actions. Hodgins et al. [ 17] use this approach to animate 
human athletes. Hand-crafted controllers place the feet, rotate the joints, and impose the 
phase of arm and leg swing. Humans running, spring board vaulting and diving have 
been generated this way. 

This method has two shortcomings. The animations it generates look “robotic”; moreover, 
success appears to be restricted to athletic motions, since these are dynamically constrained 
and highly optimized: there are not many ways to run/jump competitively. However, for or- 
dinary non-athletic motions such as strolling or picking up a light object the motion can be 
performed in many ways: each person has a particular, identificable walking style and a 
good actor can imitate various styles. In this case it is the brain that defines the motion and 
the dynamics of motion is virtually irrelevant. Thus for everyday actions dynamic simula- 
tion methods will fail to generate realistic motion unless the brain’s motor control signals 
can be properly modeled and imitated. 

Another approach to motion synthesis is to use the robotics-based techniques of inverse 
kinematics [21] (see also a similar technique called “Space-time constraints” developed 
by Witkin et al. [29]). When applied to a robot, these techniques allow the computation 
of the robot’s configuration (set of joint angles) that will place the end-effector at a cer- 
tain position and orientation in space (subject possibly to additional constraints imposed 
by the environment or other considerations). When combined with energy or torque min- 
imizing principles, these methods can be used to produce efficient and well-behaved ro- 
bot motions. Badler et al. [ 1 ] have used such solutions in the development of their Vir- 
tual Jack project. Virtual Jack has a complex virtual skeleton with over 100 joints, and 
was designed to simulate the posing (and to some extent the motion) of the human body 
in order to provide accurate ergonometric information for ergonomic studies. However, 
Jack is known to have a stiff back, stand in awkward poses, and move in a robotic fash- 
ion. The reason for this is that these robotic approaches have no notion of the natural- 
ness of a pose, or of the intricate, characteristic phases between the motions of different 
body parts. 

A third class of methods for generating realistic motion is by manipulating motion cap- 
ture data (3-D recording of people’s movements). Bruderlin [12] uses the technique of 
Motion Signal Processing, whereby multi-resolution filtering of the motion signals 
(changing the gain in different frequency bands) can produce different styles of motion. 
For examples, increasing the high frequency components produces a more nervous-look- 
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mg motion. Gleicher [15] introduced the method of motion editing with space-time con- 
straints. Here, a motion is adapted to satisfy some new constraints, such as jumping high- 
er, or further. The new motion is solved for by minimizing the required joint displace- 
ments over the entire motion, and this way attempting to preserve the characteristics of 
the original motion. In can be thought of as a generalization of the inverse-kinematics 
techniques, where instead of computing the pose to satisfy the constraint only during a 
single frame, the modification of pose in done through time. However, since there is no 
notion of how realistic (or human-like) a modification is, the method can be used only 
to generate small changes - otherwise, the laws of physics appear to be defied. 



4. WHY MOVEMES AND WHICH MOVEMES? 

In 1973 Gunnar Johannson [18] discovered the surprising ability of our visual system 
in perceiving biological motion, i.e. the motion of the human body. He filmed people 
wearing light bulbs strapped to their joints while performing everyday tasks in a dark- 
ened room. Any frame of Johannson’s movies is a black field containing an apparently 
meaningless cloud of bright dots. However: as soon as the movie is animated one has the 
vivid impression of people walking, turning the pages of a book, climbing steps etc. We 
formulate the hypothesis that in order to solve this apparently impossible perceptual task 
people must move in stereotypical ways, and our visual system must have an exquisite 
model of how people typically move. If we could understand the form and the parame- 
ters of such model we would have a powerful tool both for animation and for vision. 

In looking for a model of human motion one must understand the constraints to such 
motion. First of all: our motions are constrained both by the kinematics and by the dy- 
namics of our body. Our elbows are revolute joints with one degree of freedom (DOF), 
our shoulders are ball joints with three DOF etc.. Moreover, our muscles have limited 
force, and our limbs have limited acceleration. Knowledge of the mechanical properties 
of our bodies is helpful in constraining the space of solutions of biological motion per- 
ception. However, we postulate that there is a much more important constraint: the mo- 
tion of our body is governed by our brain. Apart from rare moments, when we are either 
competing in sport or escaping an impending danger, our movements are determined by 
the stereotypical trajectories generated by our brain [4]; the dynamics of our body at 
most acts as a low-pass filter. For example, our handwriting is almost identical whether 
we write on a piece of paper or on a board - despite the fact that in one case we use fin- 
gers and wrist and in the other we use elbow and shoulder with completely different 
kinematics and dynamics at play (this principle of motor equivalence was discovered by 
Bernstein and collaborators in the first half of the 1900s [2]). 

Why would our body move in stereotypical ways? Our brain moves our body in order 
to achieve goals, such as picking up objects, turning to look at the source of a noise, writ- 
ing. The trajectories that are generated could, in principle, be different every time and 
rather complex. However: generating trajectories is a complex computational task re- 
quiring the inversion of the kinematics of the body in order to generate muscle control 
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signals. Rather than synthesizing them from scratch every time the brain might take a 
shortcut, and concatenate a number of memorized pre-made component trajectories into 
a complete motion. Neurophysiological evidence suggests that indeed the nervous sys- 
tem may encode complex motions as discrete sequences of elementary trajectories [6, 8]. 
Moreover these trajectories appear to be parameterized in terms of Cartesian ‘goal’ pa- 
rameters, which is not surprising given the fact that most motor tasks are specified in 
terms of objects whose position is available to the senses in Cartesian space. 

This suggests a new computational approach to biological motion perception and to ani- 
mation. One could define a set of elementary motions or movemes which would roughly 
correspond to the ‘elementary units of motion’ used by the brain. One could represent 
complex motions by concatenating and combining appropriate movemes. These movemes 
would be parameterized by ‘goal’ parameters in Cartesian space. This finds analogies in 
other human behaviors: the “phonemes’ are the elementary units both in speech perception 
and produciton; in handwriting one thinks of ‘strokes’ as the elementary units. 

How realistic is this approach? Which are the natural movemes and how many are 
they? How should one parameterize a moveme? We address these questions in the next 
sections. 

In comparing the ‘frame-by-frame puppet-tracking’ approach to a moveme-based ap- 
proach one notes that the second has the potential to reduce dramatically the number of 
parameters to be estimated, thus conferring great robustness on a vision system that 
knows about movemes. Moreover, a moveme-based approach transforms the description 
of human motion from continuous time trajectories to sequences of discrete tokens. The 
latter description is a better starting point for high-level interpretation of human motion. 



5. MOVEMES IN PRACTICE 

For a moveme-based approach to exist one must specify a set of movemes. How might 
one define a moveme, and how might one discover this set? As we mentioned above, our 
working hypothesis is that movemes are building blocks used by the brain in costructing 
the trajectories that our body should follow in order to achieve its goals. Therefore it is 
not easy to measure movemes directly; we may have to settle with observing movemes 
phenomenologically and indirectly. The following five criteria summarize our intuition 
on what makes a good moveme, and guide our search. A moveme should be 

Simple - Few parameters should be necessary to describe a moveme. There should be 
no natural way to further decompose a moveme into sub-movemes. 

Compositional - A moveme should be a good ‘building block’ to generate complex 
motions. One should avoid movemes that are not necessary. 



Sufficient - The set of all movemes should suffice to represent all common human ac- 
tions and activities with accuracy. 
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Semantic - Motion is most often goal-oriented; movemes should correspond to simple 
goals, which provide a natural parameterization for the moveme. Roughly speaking, a 
good moveme should correspond to a verb in language [23] and it should be parameterized 
by a meaningful goal in ambient space. The descriptors for a moveme are thus at least two 
sets of parameters: the identity of the moveme, and the goal of the moveme (we will see 
later that at times additional ‘style’ parameters are needed). 

Segmentable - It should be easy to parse the continuous pattern of motion that a 
body describes over several minutes, or hours, into individual movemes. This is the 
case if the beginning and ending of a moveme are stationary or otherwise stereotypi- 
cal. This makes estimating moveme models easier; more importantly, it makes it eas- 
ier to compose complex motions from movemes (simple boundary conditions between 
individual movemes) and cheaper to recognise movemes automatically in a computer 
vision system (easy criteria for bottom-up segmentation). 

This last property is not strictly necessary, but it is covenient and thus desirable. 

A ‘complete set' of movemes may be defined as the set of all the movemes that a hu- 
man will use. Alternatively, it may be defined as the minimal set the movemes combin- 
ing and concatenating which all the motions of a human may be described with suffi- 
cient accuracy. 

The task of enumerating a complete set of movemes goes beyond the scope of this pa- 
per. If one takes the analogy of phonemes this effort is likely to take many years of ac- 
cumulated experience and insight. The goal of this paper is more modest: beyond pro- 
posing an argument for the study of movemes, which we have just done, we wish to ex- 
amine a small set of ‘case studies’ that will allow us to exemplify our intuition and to ex- 
plore such practical issues as the complexity of a moveme, the accuracy with which we 
may expect a moveme to model human motion, how to compose movemes into complex 
motions. 



6. METHODS 

Guided by the principles listed above, we have chosen six movemes for our analysis: 
step, step over, look, reach, draw, run. These movemes are a sufficient set to test a num- 
ber of modeling strategies, and represent a range of moveme complexity. 

To model each moveme, we take a phenomenological or ‘perceptual’ approach. 
Rather than attempting to build models based on physical or optimality principles, we 
build models from observations of people performing movements. Thus our models 
will be empirically derived functions mapping moveme parameters to the motion of the 
entire body. For example: in the case of the moveme ‘reach’ we parameterize the 
moveme with the three coordinates of the point to be reached. Thus we assume that the 
entire motion of the body for the 1-2 seconds that the motion lasts (i.e. approximately 
10 4 parameters) is determined by just three numbers, the Cartesian coordinates of the 
target. 
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In order to satisfy the compositional criterion of a moveme, a moveme must take into 
account the initial pose the body is in at the beginning of the movement. This can be 
done by using state parameters along with the goal parameters. For instance, in placing 
a step, it is necessary to know where the foot starts off, and this can be encoded in a state 
parameter for the initial foot position. 

Collecting the data to build a moveme model involves the following steps: 

• Capture: A 3-D optical motion capture system is used to record the motion of a per- 
son. The systems records motion at 60Hz with an accuracy of 1mm. 18 markers are 
placed on the body at the main joints. The subject acts out several samples of a moveme 
by, for example, walking back and forth to provide samples of taking a step, or repeat- 
edly reaching to various locations to provide samples of reaching. 

• Segmentation: The motion capture data is segmented into individual moveme sample 
trajectories by detecting the start and stop of each sample trajectory. To facilitate analy- 
sis, all the trajectories are re-sampled in time to be of same duration. 

• Representation: The motion capture data is converted to ‘bone-marker’ data; rather 
than using the coordinates of the 18 markers directly, a 3-D skeletal model of the actor 
is used to compute the corresponding motion of the joints. Then virtual mocap data is 
generated, placing 21 bone-markers at the major joint centers. This procedure ensures a 
standard representation of body motion, and eliminates inaccuracies due to inexact 
placement of markers on the body. 

• Labeling: Each moveme sample trajectory is labeled with the appropriate values of 
the goal and state parameters. For instance, reach moveme samples are labeled with the 
3-D coordinate of the final hand position during the reach. 

The set of labeled moveme sample trajectories form a set of input-output pairs (the goal 
and state parameters, and the resulting body trajectories) that can be used to build models, 
as discussed in the next section. One decision that remains, however, is how to represent the 
motion of the entire body. There are two natural representations to chose from. The body mo- 
tion can be represented in terms of the angular coordinates of all the body joints, or it could 
be represented in terms of the motion of the 21 bone-markers. The former representation has 
the possible advantage of implicitly incorporating some knowledge of the kinematic struc- 
ture of the human body. However, it is a highly non-linear space, as compared to the 3-D Eu- 
clidean space of marker coordinates, which may make it more difficult to learn an accurate 
model. Furthermore, errors in angular coordinates affect the body pose in a cumulative fash- 
ion as one progresses through the kinematic chain of the spine and limbs, whereas errors in 
marker coordinates do not affect each other. Finally, Bizzi [7] has put forward the case that 
the brain represents and controls trajectories in Cartesian coordinates. For these reasons, we 
represent body motion with the trajectories of the 21 bone-markers in Cartesian space. 



7. MODELS 



Let m be a moveme, and y m (x) represent the sample trajectory of moveme m with label 
(goal and state parameters) x. Then modeling of movemes can be viewed as a function 
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approximation problem; given a set of input-output pairs {x, y }, find a function f which 
approximates the output y given the input x. If we found estimate f from our training da- 
ta, then animation can be generated from a very high level; by choosing the moveme and 
the goal parameters, the motion of the entire body is generated. Likewise, for recognition 
and perception, one needs to find which of many possible moveme functions (and asso- 
ciated values of the goal and state parameters) best fit an observed motion. 

We experimented with several different model types. Namely, linear models, higher-or- 
der polynomial models, radial basis function networks, and feed-forward networks with 
sigmoidal hidden units (see appendix A for details of each model type). Surprisingly, the 
simplest of models (the linear model) usually performs quite well, as will be shown in 
section 10, where we discuss the issue of performance evaluation. 



8. STYLES 

Thus far we have proposed to parameterize movemes by action type (walk, reach, look, 
etc.) and by goal. Johansson noticed that, from his displays, one can tell the age, sex, etc. 
of the person performing a motion. Therefore we need to add a new set of parameters, 
which we call “style” parameters. Now, every time new parameters are added the num- 
ber of samples necessary to estimate a moveme increases. We postulate that the style and 
goal parameters are orthogonal, i.e., that we may estimate them independently. More 
precisely: 

Suppose F : x > y is a function which maps a given label x to a moveme trajectory y. 
Suppose also that samples { Xi,y;} of the moveme performed with a new style are avail- 
able. Then we can calculate the residual motions due to the new style as {n = y, - F (x,) ). 

Using the residual motion and the sample labels, {Xi.n}, a residual function R : x ->• y 
can be learned. Then, to synthesize a new trajectory withe the new mood or style, the 
original function F and the residual function R can be combined. In fact, a modulating 
parameter, a, can be used to control how much of the new mood to use: 



ynew-mood — F (x) T" Q R(\ ) (1) 

This is similar to the recent paper by Blanz and Vetter for face morphing [9] where lin- 
ear combinations of sample faces are combined to produce new faces, except here entire 
trajectories are morphed. 

Experimentally, it was verified that this technique works well; with fewer examples, 
it is possible to learn a new style. Intuitively, one can argue that a mood or style typi- 
cally can be viewed as a Towpass’ perturbation of the overall motion. While the nomi- 
nal function needs to be learned with many examples to ensure that a proper moveme 
trajectory is generated throughout the input label’s domain, the new mood’s residual is 
a small signal that is not as complicated (does not change as much as a function of the 
label). Thus, fewer examples are needed to encode it. Furthermore, perceptually, given 
that the overall moveme is correct, ‘errors’ in mood or style are difficult to perceive. 
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9. DETAILS OF MOVEMES 

In this section we describe in detail the movemes that were analyzed. We describe the 
moveme, the datased acquired, and the choice of goal parameterization. 

9.1 2-D REACH 

Dataset: Figure la shows some snapshots of a reaching moveme end-poses. The data 
was acquired with a 3-D mocap system using a single camera. The actor stood facing the 
camera, with arms by his sides, and then reached to a location around him, as if picking 
an apple from a tree (or from the ground). In order to make the dataset consistent, the ac- 
tor always stood in the same initial position at the onset of each sample, and returned to 
that pose at the end of each sample. 91 samples of reach movemes were collected, the 
goal locations are shown in figure lb. The actor reached towards 12 regularly spaced di- 
rections, near the limit of his reach range, and half way. The duration of each reach 
moveme sample trajectory varied from 90 to 117 frames (almost 2 seconds), and they 
were all uniformly re-sampled to be 117 frames long. 



A 

Or* 

A 




Figure 1. (a): Sample reach poses for learning a ‘reach moveme’. Starting from the the rest position (center), 
the subject reached with his right hand to various locations around him. In (b), the 91 sample reach locations 
are plotted. The motion captured trajectories are used to learn a model of the elementary reach ‘moveme’, pa- 
rameterized by the desired reach location (the ‘goal’ of the moveme). 
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Moveme parameterization: In the reach dataset, since the actor always started from 
the same initial condition and ended in the same final position, no state parameters are 
needed. The goal parameters are the 2-D screen coordinates of the position reached. Fig- 
ure lb shows the locations of the 91 sample reaches. 

9.2 3-D REACH 

Dataset: This dataset is similar to that of the 2-D reach, except that the data was captured 
in 3-D. The actor started from the rest position (standing facing forwards, arms at side) and 
reached to various locations in 3-D, returning to the rest position after each reach. 

Moveme parameterization: As in the 2-D case, since the actor always started from 
the same initial pose, no state parameters are used, and the goal parameters are the co- 
ordinates of the 3-D reach location. 



9.3 2-D DRAW 

Dataset: Figure 2a shows some examples of 2-D drawing data. To perform this 
moveme, the actor stood and faced the camera. Starting at a random location, the ac- 
tor performed simple strokes, as if drawing on a chalk board, consisting of either 
straight lines or curved arcs (but never curves with inflection of curvature). In all, 378 
samples of the draw moveme were collected, with the strokes varying in position in 
space, size, and amount of curvature. The draw moveme was chosen to be analyzed 
because it is a good candidate for exploring how to compose movemes to create more 
complicated actions, as described in section 1 1 on concatenation. For example, writ- 
ing on the board can be decomposed as a sequence of elemental draw movemes. 




Figure 2. (a): Sample drawing strokes of a ‘draw’ moveme. Both straight and curved strokes at various scales were 
drawn, for a total of 378 samples. The motion of the entire body labeled with a parameterized encoding of the 
shape of the stroke is used to learn a draw moveme model, (b): A typical sampled stroke, and corresponding cu- 
bic interpolation defined by an 8-dimensional stroke label. Since the sample strokes were constrained to not have 
inflection of curvature, a simple cubic interpolant represents the dataset trajectories very well. 
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Moveme parameterization: The goal of the 2-D draw moveme is to draw a simple 
stroke. Thus, the goal parameters need to describe the path of the drawing hand. To repre- 
sent the path in a compact form, it is represented by the coefficients of a cubic interpolant: 
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where time has been rescaled to be t = 0 at the beginning of motion and t = 1 at the end. 
Finding the coefficients to solve the above equations is a minimization problem, and does 
not guarantee a perfect fit. However, since we want to have an accurate control/estima- 
tion of the initial and final state of the body for the purpose of composing movemes, it is 
important to represent the starting and ending position accurately (i.e., equations 2 must 
be hard constraints for t = 0 and 1=1, and soft constraints for 0 < U < ... < U < 1). Let 



P(t) = [ 1 t f f ] 



and let C* be a solution of 
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Also let C" i, C'i be a basis to the null space of [P(()) T P( 1 ) r X .Sj'C* = 0. Then a solution 
satisfying the hard constraints and minimizing the error in the soft constraints can be 
found by solving for wi and W 2 which solve the least squares problem: 



X-PC* = [PC' 1 ' PO] 




( 6 ) 



where X = [x(fi);...;x(L)] and P = [P(ti);...;P(f„)] 

Figure 2b shows a sample fit of a stroke (it is difficult to visualize the resulting coeffi- 
cients themselves since they lie in an 8-dimensional space). 



9.4 3-D WALK 

Dataset: The subject walked back and forth along straight and curved paths, starting 
and stopping, with a consistent style of walking. The aim was to capture samples of 
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walking with as much variability as possible in terms of lenght of step, and curvature of 
path. Due to a limitation of the mocap system (3x4 m capture space and only 4 cameras), 
the actor had to always face the same general direction (+/- 30 degrees) and was not free 
to walk around naturally, which would have been ideal. Instead, short sequences of walk- 
ing along a straight or curvy path for 5 steps, followed by walking back to the starting 
position, were repeated over and over. After segmentation and labeling, 124 samples of 
stepping with the left foot, and 119 samples with the right foot were obtained, to be used 
to learn two separate movemes (stepping with each foot). The samples were all resam- 
pled in time to be 45 frames in duration (0.75 seconds). 

Moveme parameterization: the 3-D walking dataset is more complicated than the 2- 
D reach or draw datasets, in the sense that it requires a parameterization of the state of 
the actor at the beginning of the step. Also, it is not immediately obvious how to para- 
meterize a step. 

One possible parameterization is the position of the feet at the start and end of the step, 
and postulate that the entire body motion follows from that. Specifically, the reference 
frame of each step is taken as the (initial) position of the stationary (support) foot, with 
the x-axis along the foot direction. The state of the character is defined by the initial po- 
sition of the moving foot, and the goal is defined by the final position. Figure 3 shows 
three sample start and end poses for stepping with the left foot, and figure 4 shows the 
labels for all the examples of left and right footed steps. 



Dataset: Examples of walking in a different style - ‘happy’ (swaying back and forth, 
with a little bounce in the step) - were acquired. The aim was to use this dataset to learn 
an incremental model of a new style of walking based on the original walking moveme 
(as described in section 8 on learning styles). Thus, only 16 samples of happy walking 
(16 steps with left and right foot each) were acquired. 



Figure 3. Three sample start and end poses for stepping with the left foot, with left ankle trajectory traced out 
in 3-D. A total of 124 samples were used to learn the ‘walk’ moveme. 

Moveme parameterization: In this dataset, the labeling of the samples is done exact- 
ly as in the 3-D walking case. The level of happiness is assumed to be constant (the ac- 
tor needs to perform the moveme samples consistently). During synthesis (after the ap- 
propriate models have been learned), the happy mood can be combined with nominal 
walking. Figure 5 shows the labels for the dataset. 



9.5 3-D HAPPY WALK 
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Figure 4. Parameterization of the ‘walk’ moveme involves defining the initial and final position of the moving foot 
with respect to the reference frame of the stationary pivot foot. The pivot foot (marked by the foot icon) has the an- 
kle marker at the coordinate origin, with the foot- tip marker aligned along the x-axis. Circles denote sample start 
positions of the stepping foot, and diamonds denote the end positions. Trajectories on top (red) are those of the left 
foot ankle marker (left-footed steps), and those on bottom (blue) are those for the right-foot. Units in cm. 




Figure 5. Plot of foot start and stop positions (in cm) and trajectories for walking data acquired with a new style: 
happy walking. Learning of styles is done based on the original moveme model and requires fewer samples (16 
for stepping with each foot). Plot conventions same as figure 4. 

9.6 3-D SAD WALK, 3-D SEXY WALK 

Dataset: Two additional styles of walking were acquired. In one, the actor walked as if 
he were very sad; head hung low, dragging his feet, and with little arm movement. In the 
other, the actor walked ‘sexy’, with exagerated hip swing and arm motion. 

Moveme parameterization: As with the happy style, the motion is parameterized with 
the same type of parameters as the original 3-D walk moveme data. 

9.7 3-D STEP-OVER 

Dataset: Examples of stepping-over an object were recorded. The variability of the sam- 
ples included the size of the object (roughly three heights, 15 cm, 30 cm, and 50 cm were 
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Figure 6. Two representatives of the stepping-over moveme, with the trajectory of the stepping ankle traced out 
in 3-D. 



used). Because an actual physical object would interfere with the motion capture process, 
an imaginary object of variable height and length was used. Also, the angle of attack, and 
curvature of the walk path during the stepping-over was varied. In all, 29 steps with the left 
foot and 34 with the right foot were captured. Figure 6 shows the start and end pose of two 
samples of stepping-over with the left foot, displaying the 3-D trajectory of the left ankle. 

Moveme parameterization: For the ‘stepping-over’ moveme, not only does the initial 
and final position of the stepping foot have to be specified, but also the height of the object 
(actually the height to raise the foot to) is part of the goal parameters. This value was ex- 
tracted from the examples as the maximum height that the moving foot achieved. See fig- 
ure 7 for the labels of all the step-over examples. Note that for stepping-over an object, there 
are four possible movemes involved; the lead step with the left or the right foot, and the fol- 
low-through step with the other foot. The dynamics of the movement of the lead foot is quite 
different from that of the follow-through foot, and so those movemes should be learned sep- 
arately. Due to a limitation in the mo-cap system (the light-bulb suit limited the range of mo- 
tion of the knees), it was not possible to capture examples of the follow-through moveme. 




Figure 7. Plot of foot start and stop positions (in cm) and trajectories for the stepping-over moveme. Note that 
these steps are much longer than during normal or happy walking. The moveme is parameterized not only with 
the foot start and stop positions, but also by the height of the object being stepped over. Plot conventions same 
as for figure 4. 
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9.8 3-D LOOK 

Dataset: Starting from the same initial position, looking straight ahead, the actor turned his 
head, neck and torso (entire body, in fact) to look in various directions; looking straight up, 
down to the floor, to the left, to the right. The visual hemi-field was sampled approximately 
every 20 degrees (vertically and horizontally) to generate the samples. In total, 34 samples 
were acquired. 




1 



Down* 

,5 2 1 5 1 0 5 0 02 1 15 2 

Azimuth (radiants) 

Figure 8. The labels for the 34 acquired samples of the look moveme. X-axis is azimuth of head, Y-axis is el- 
evation (angles in radians), where origin (open circle) corresponds to the subject looking straight ahead. 



Moveme parameterization: In the 3-D look dataset, the actor always started and end- 
ed from the same initial position, thus it is not necessary to encode any state parameters. 
The goal parameter is the gaze direction (azimuth and elevation), computed as the az- 
imuth and elevation of the vector from the centroid of the two ear markers to the fore- 
head marker. Figure 8 shows the labels for the samples. 

9.9 3-D RUN 

Dataset: Data to learn a run moveme was captured by placing an actor on a treadmill 
and having him run at various speeds, from standstill to all-out sprint (or as close to a 
sprint as was safe). Ideally, one would capture data of a subject running along various 
straight and curved paths, but this was not possible due to the small capture volume of 
our motion capture system. Like the walk moveme, data was separated into left-footed 
and right-footed strides to learn two (half-cycle) run movemes. 

Moveme parameterization: Since the actor was facing forward on the treadmill at all 
times, the only goal parameter definable is the speed of the running, measured as the av- 
erage speed of the treadmill during each stride (with the lenght of the stride varying as 
a function of speed). 
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10. PERFORMANCE 

In this section we study the quality of the different moveme models using three diag- 
nostics. First, the root mean square error can be used as a numerical measure of quality. 
Second, a perceptual evaluation can be made by using the models to synthesize 
movemes and having naive observers rate them. Finally, the quality of the models when 
they are used to synthesize movemes outside of the normal input range is studied by 
qualitative visual comparison. 



10.1 RMS PERFORMANCE 

We use the normalized root mean square error to quantify the quality of a moveme 
model. The normalized root mean square error is the RMS error of the model divided by 
RMS error of the simplest of all models, the constant mean value model. A value of 0 in- 
dicates a perfect model with no error, and a value of 1 indicates the model is as inaccu- 
rate as the mean value model. 

The first two tables (Tables 10.1, 10.2) show the normalized error levels of the linear, 
quadratic, and radial basis function models for the 2-D reach and 2-D draw moveme, re- 
spectively. The first column shows the error when the entire dataset was used to learn the 
model. The second column shows the cross-validation error, where some of the data 
samples were not used for building the model, but instead were used only to compute the 
model error. For the reach moveme, increased model order improves the accuracy of rep- 
resentation, and only minimal over-fitting is occuring. Note that for the same size of 
model the polynomial models are significantly more accurate than the RBF models. With 
the draw moveme, since the input (parameter) space is large (8-dimensional) and not 
many samples are available (378), the effects of over-training are clearly visible; the 
cross-validation error of the quadratic model is almost double that of the linear model. 
Also the radial basis function model has an error more than 10 times as large as the lin- 
ear model, since it is very difficult to cover a large dimensional space. 

From the results above we conclude that polynomial models outperform radial basis 
function models, and so concentrate further analysis on those models only. In table 10.3 
we indicate the representational accuracy of linear and quadratic models for the differ- 
ent 3-D movemes. Note that in many instances the quadratic model does not improve 
significantly over the linear. For styles, only the result for linear models are shown since 
there are not enough data samples to create a quadratic model. 

10.2 PERCEPTUAL EVALUATION OF MOVEME MODELS 

The perceptual quality of the moveme synthesis method was evaluated with formal and 
informal tests with various subjects. 

In the formal tests, subjects were presented a two alternative forced choice paradigm 
where they were asked to distinguish between original and re-synthesized movemes. 
Two actions were tested, 2-D reaching, and 2-D drawing. The movemes were presented 
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Table 10.1. Comparison of different learning models for the reach moveme. Linear, quadratic, and cubic inter- 
polants, as well as Radial Basis Function (RBF) networks with 5, 9, and 20 basis functions were used to crea- 
te models mapping the 2-D reach location to the motion of the entire body (14 makers in 117 frames). Increa- 
sing the polynomial degree increased the accuracy of model, seeminghly without over-training. For a similar 
model size, RBFs generalized less well. Cross-validation values represent normalized RMS error mean ± stan- 
dard deviation for 1000 independent iterations of training and testing. 



Method 


RMS Error 

(all data set) 


RMS Error 

(2/3 learn, 1/3 test) 


Model size 


Polynomial 


Degree=l 


0.1357 


0.1533 ±0.0163 


3276x3 


Degree=2 


0.0677 


0.0842 ± 0.0108 


3276x6 


Degree=3 


0.0460 


0.0663 ± 0.0100 


3276x10 


RBF 


N = 5 


0.0921 


0.1335 ±0.0145 


3276x6 


N = 9 


0.0461 


0.1011 ±0.0122 


3276x10 


N = 20 


0.0314 


0.0945 ± 0.0109 


3276x21 



Table 10.2. Comparison of different learning models for the draw moveme. Due to the large dimension of the 
input space (an 8 dimensional vector is used to parameterize the shape of the stroke to draw), the RBF models 
are approximately three times less accurate than polynomial interpolants, given the same number of model pa- 
rameters. Although the errors of the polynomial interpolants are larger than for the reach moveme models, tra- 
jectories synthesized with these models still retain a realistic appearance. 



Method 


RMS Error 

(all data set) 


RMS Error 

(2/3 learn, 1/3 test) 


Model size 


Polynomial 


Degree=l 


0.260 


0.258 ±0.10 


1680x9 


Degree=2 


0.215 


0.344 ± 0.08 


1680x17 


RBF 


N = 8 


0.749 


0.872 ± 0.04 


1680x9 


N= 16 


0.608 


0.831 ±0.04 


1680x17 



as moving light displays [18], and each subject viewed 30 stimuli pairs. For each stim- 
uli pair, an original moveme was randomly chosen from samples in the motion capture 
dataset, and the corresponding moveme label was used to create a synthetic moveme. 
For the reaching action, a 3rd order polynomial model was used, while for the drawing 
action a linear model was used. After presenting the stimuli pair (with random order of 
appearance between the original and synthesized moveme), the subject chose which ap- 
peared more realistic. In case the subject was unsure, he was instructed to guess. If the 
true and re-synthesized movemes were completely indistinguishable, subjects should 
perform the task at chance level (50% correct responses). The results in table 10.4 show 
that indeed our subjects were unable to distinguish between real and synthetic movemes. 

Various informal tests were also conducted. During the development of the different 
moveme model techniques, perceptual tests were always used to assess model quality. 
Perhaps the most significant perceptual tests were those where an interactive demo that 
used the moveme models to synthesize animation in real-time was shown to profession- 
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Table 10.3. Summary of the size of 3-D moveme data and models. The first column list the number of sample 
motions acquired for each moveme. The second column is the number of goal and/or state parameters used to 
parameterize the moveme. The ‘RMS Error’ columns denote the normalized RMS error when linear and qua- 
dratic models are used to represent the movemes. For learning of styles, only the linear model can be applied 
due to the small number of sample motions. ‘Compression’ indicates the ratio of the size of the raw data of a 
sample trajectory divided by the number of label parameters that specify the movement. 



Moveme 


N. Sample 
Motions 


N. Labels 
(Parameters) 


RMS Error 


Compression 


Linear 


Quadratic 


3-D Walk - Left Step 


124 


4 


0.332 


0.314 


600 


3-D Walk - Right Step 


119 


4 


0.281 


0.267 


600 


3-D Look 


32 


2 


0.531 


0.385 


3160 


3-D Run 


56 


1 


0.675 


0.627 


2400 


3-D Reach 


33 


3 


0.393 


0.270 


6480 














Step-Over - Left Step 


29 


5 


0.427 


- 


480 


Step-Over - Right Step 


34 


5 


0.451 


- 


480 


Happy - Left Step 


16 


4 


0.754 


- 


600 


Happy - Right Step 


16 


4 


0.726 


- 


600 


Sad - Left Step 


19 


4 


0.743 


- 


600 


Sad - Right Step 


18 


4 


0.735 


- 


600 


Sexy - Left Step 


22 


4 


0.709 


- 


600 


Sexy - Right Step 


25 


4 


0.697 


- 


600 




Figure 9. A real-time, interactive synthetic character performing look, reach, walk, and step-over movemes 
based on the high-level control of a user. The resulting motion is not only extremely versatile due to the use of 
goal and style parameters, but is also very realistic and fluid. 





162 



LUIS GONCALVES, ENRICO DI BERNARDO AND PIETRO PERONA 



al game developers at a Game Developer’s Conference (September 98) and in private 
meetings (August 99). They were unanimously impressed by the quality of the anima- 
tion, many claiming that they had never seen such high quality real-time animation. Fig- 
ure 9 shows some screen shots of a synthetic character performing various movemes. 



10.3 EVALUATION OF MODEL EXTRAPOLATION 

When synthesizing a new moveme for animation purposes (and even during recogni- 
tion and perception), it is not always guaranteed that the moveme parameters will fall 
within the convex hull of the samples used to learn the model. To see how different mod- 
els fare, the same (extrapolating) parameters are applied to the various models, and the 
resulting moveme are inspected visually. As expected, the linear model is much better 
behaved than the quadratic model and the radial basis function model. The radial basis 
function model simply fails to represent motion output beyond the convex hull, result- 
ing in poor synthesis. The quadratic model “explodes” due to the large effect of the quad- 
ratic terms beyond the convex hull. 

Thus the empirical conclusion from all the different evluations of the moveme models 
is that linear models are the best compromise in terms of accuracy of representation, re- 
quiring fewer examples, and extrapolation ability. 



Table 10.4. Perceptual discriminability between original motions and reconstructions for the reach and the 
draw movemes. Subjects were presented 30 pairs of real and synthetic motions in random pair-wise order and 
were asked to determine whether the first or the second was the original motion. The mean and standard de- 
viation match quite closely to the theoretical values for 30 i.i.d. coin tosses (mean of 50%, standard deviation 
of 9.13%), indicating that subjects found it difficult to dinstinguish between the two. 



Subject 


Reach 
% Correct 


Draw 
% Correct 


B.P. 


36.7 


50.0 


L.G. 


60.0 


50.0 


J.B. 


33.3 


53.3 


Y.S. 


46.7 


- 


Y.K. 


63.3 


- 


RM. 


60.0 


- 


P.P. 


46.6 


- 


M.M. 


- 


46.7 


E.D. 


- 


46.7 


Mean 


49.51 


48.28 


St. dev. 


11.93 


3.59 
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11. CONCATENATION 

If movemes are to be used as elemental motions, it should be possible to concatenate 
them together to form more complex actions. This is possible with our definition of 
movemes through careful design. To ensure good concatenation, the body pose at the end 
of one moveme needs to match the initial pose of the next moveme. For the look, and 
reach movemes, the initial (and final) pose is the ‘rest’ pose; standing looking forward, 
with arms at the side. These movemes are also cyclical, in the sense that the full motion 
is represented; both the forward moving part of a reach (or look) and the return motion 
are represented. Thus they can easily be concatenated. The step moveme is also cyclical, 
so that it can be concatenated with itself. However, in a typical stride the body pose will 
not match up with the rest pose of the reach and look movemes. To ensure that concate- 
nation of the walk moveme with the reach and/or look moveme is possible, the set of 
sample trajectories of the walk moveme included examples of starting to walk, and stop- 
ping, where (respectively) the subject started from the rest pose and ended in the rest 
pose. 

In the limit, one can hypothesize that the models for all movemes have state parame- 
ters that encode the initial pose of the body (or at least the pose of the relevant body 
parts) so that a properly concatenated moveme can always be generated. 



12. DISCUSSION AND CONCLUSIONS 

We have proposed to model human motion by concatenating discrete elements, the 
‘movemes’, which are learned by observing people move. Our argument in favor of such 
model stems from the observation that common human motions are stereotypical and de- 
termined by somewhat arbitrary patterns that are generated by the brain, rather than be- 
ing determined by the mechanics of the body. Suggestions in this direction come both 
from the motor control literature and from the physiology and psychology of perception. 

We have explored six movemes and highlighted their properties. We demonstrated that 
movemes greatly compress motion information: just a few goal and style parameters en- 
code human motions which are characterized by thousands of position parameters. All 
the movemes that we studied had a natural parameterization in terms of the goal of a spe- 
cific simple action that is associated with the moveme. Simple linear or quadratic maps 
were found to represent each moveme very well, to the point that human subjects are un- 
able to tell motions produced by human actors apart from motions produced by our mod- 
els. Such maps are specified by few parameters and thus require few training examples 
to be learned. 

More questions are opened rather than settled by postulating the existence of movemes. 
First of all: how many movemes might there be and how might one go about building a 
catalogue. We do not have an answer to this question: we may only take educated guess- 
es by comparing with phonemes and by counting action verbs in the dictionary; the es- 
timates we get range around 100 movemes (order of magnitude). A related practical 
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question is whether it would be possible to define subsets of movemes that are sufficient 
for visual perception and animation of human motion in restricted situations: e.g. urban 
street scenes, mountain hiking, playing hide-and-seek in woods. 

A second class of questions regards the practical aspects of using moveme-based mod- 
els for animation and vision. We have successfully animated characters using movemes 
and combinations of movemes, and we have attributed different ‘styles’ to these charac- 
ters as well. Rose et al [24] also describe an animation system based on ‘verbs’ and ‘ad- 
verbs’ which is similar to our movemes with goal and style parameters. It is still unclear, 
however, whether superimposing and concatenating new types of movemes will always 
be natural and possible, or whether there will be a combinatorial explosion of boundary 
conditions to be handled in combining different movemes. 

Where does all this place us in interpreting biological motion perception? Movemes ap- 
pear to be a natural and rich representation which the brain might employ in perceiving 
biological motion. However, more work needs to be done to build a complete moveme- 
based human motion tracker/recognizer that might be compared with psychophysical 
measurements [18]. A germ of such a system may be recognized in Bregler and Malik’s 
motion tracker [13] which switches between four linear time invariant dynamical sys- 
tems representing different phases of a walking step. Black et al [25] also describe a 
movemelike approach where parameterized ‘temporal primitives’ of cyclical motions are 
estimated from motion capture data, and the models are then used to track the motion of 
a subject in 3-D. Finally, Mataric [20] has developed a perception-action model based on 
a compact set of ‘basis behaviors’ that enables a humanoid robot to both perceive human 
action and imitate it. 

A third class of questions has to do with the high-level aspects of the analysis of hu- 
man motion. The final aim in many applications is to ‘understand’ human activity from 
images, or synthesize motion from high-level descriptions. This is done at a more ab- 
stract level than movemes: we are ultimately interested in descriptions such as ‘she is 
cautiously pruning her rose-bushes’ rather than ‘... steps forward, then reaches for stem 
while looking at stem ...’. How well will we be able to map sequences of movemes to 
high-level descriptions of activities and vice-versa? 



A. MOTION MODELS 
A. 1 Linear models 

The simplest model one can use is a global linear model. If [x’} is the set of all label 
vectors (written as column vectors) for a particular dataset with N, samples, we can form 
an augmented input matrix, X: X = [1...1; x 1 ...x v >]. X is the matrix of sample inputs, aug- 
mented by a row of l’s. This row is used to learn the constant bias term in the model. 

Likewise, if [ v' } represents all the sample movemes, where for sample s, the column 
vector y consists of the stacked up coordinates of all the markers throughout all the 
frames of the moveme, then we can form the sample output matrix Y: Y = [y 1 ...y :Vi ]. 
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Now the best linear model can be found by solving for L in the least squares problem 
Y = LX, and this is easily done as L = YX'(XX') . 

Then our map is y = f(x) = YX'(XX') l x. 

A. 2 Higher-order polynomial models 

The next obvious choice for a model, then, is to include higher order terms in the mul- 
tidimensional polynomial interpolant. One can learn a global quadratic model by adding 
additional rows to the input matrix X corresponding to the pairwise products of individ- 
ual label parameters. For a 2-dimensional label, three such products can be formed; with 
a 4-dimensional label there are 10. 

The process can be formalized by defining Ni, polynomial basis functions, with n" func- 
tion ®,,(x) defined as 

Nd 

®„0) - n X f» (7) 

where x is a sample moveme label, x- is the i"’ component of the label vector, and cu is 
the power to which the i"' label component is raised in the n"' basis function. For the con- 
stant basis function, all c’s would be zero; for linear basis functions, only one c is 1 and 
all others zero; and for quadratic basis functions the sum of the exponents has to equal 
2 (either one is 2 and the others zero, or two of them are 1 ). 

If we denote OCX) the matrix generated by applying all the basis functions to all the 
sample moveme labels. 



Oi(x‘) 



0(X) = 



LOa-Xt 1 ) 



0,(r v ’) 

0,„(x v ’). 



( 8 ) 



then the best global polynomial model can be found by solving for the coefficient ma- 
trix W in Y = W OCX ), which is also solved by the least squares pseudo-inverse method. 

In principle, with basis functions of higher and higher polynomial degree the func- 
tion approximation can be more and more accurate (if the basis includes all possible 
polynomial terms up to a certain degree, the model is a multi-dimensional Taylor ex- 
pansion around the zero vector). However, the model can quickly become unwieldy, 
as the number of basis functions (and size of the coefficient matrix W) grows expo- 
nentially with the degree of the approximation (and this in turn demands the use of 
more training data to prevent over- fitting). 

Although it will be shown in the experimental section that global quadratic models im- 
prove the fidelity of the re-synthesis of the sample movemes, it will also be shown that 
the synthesized outputs do not degrade gracefully when the input label goes outside the 
range of the training examples. This is the same behavior as in the simple scalar input 
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and output case, where it is known that the higher the polynomial order, the worse the 
extrapolation error. 



A.3 Radial basis functions 

A further generalization of the method of learning weights for a set of basis functions is 
to use other sets of functions other than polynomial functions for the set of basis. One 
set of functions that has been widely studied is that of radial basis functions [5], The ba- 
sic idea behind radial basis functions is that instead of using global polynomials of high 
degree to learn a highly non-linear function, one can use many basis functions with lo- 
cal support, and each one encodes the local behavior of the system; The n"‘ radial basis 
function is defined as: 



®„(x) = exp((x - p*)'!* -1 ( x - p»)) (9) 

Thus the n"' radial basis function is centered at p. in the input space, has value 1 when x 
= p», and the value decays to 0 as x moves away from p» at a rate controlled by the co- 
variance ‘spread' matrix I„. As explained in [5], if we associate a radial basis function 
with each sample input such that for <f>„ we let p. = x" (where x" is the n' h sample input), 
then a matrix of weights can be learnt that will interpolate through the samples exactly. 
Just as in the global polynomial models, we solve for W in Y = W0(X ), where now the 
matrix ®(X ) consists of the values of all the radial basis function for all the sample in- 
puts. Furthermore, for an appropriate value of the spread parameters I„, the model can 
be made to interpolate smoothly between samples. 

However, one may not want an exact interpolation through all the samples because a) 
one knows that there is noise in the samples, and/or b) it would result in too large a mod- 
el (if there were many samples). In this case, one uses many less radial basis functions, 
and the basis centers, p», also have to be learnt, along with the coefficient matrix W and 
the spread matrices T». 

There are many different learning algorithms for optimizing the model parameters. Be- 
cause of the highly non-linear nature of the model, it is very computation intensive to 
optimize over all the parameters. One simplification is to replace the spread matrices 
with a constant, an a priori specified value that is the same for all basis functions. This 
is the form of the model that we experimented with. 

Although radial basis function models can provide good results in some instances, it suf- 
fers from three drawbacks that make it mostly unsuitable for learning models of human 
motion. First, the number of basis functions required to ‘fill the input space’ grows ex- 
ponentially with the number of dimensions of the moveme labels. Second, the basis 
functions are placed where the data lies in the input space, but if the input space is not 
sampled uniformly, there may be gaps, and then the model is not guaranteed to interpo- 
late properly within large gaps. Finally, because of the local extent of each basis func- 
tion, the model cannot extrapolate very well. 
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A. 4 Feed-forward networks 

The widely known method of feed-forward neural networks trained with the back-prop- 
agation scheme [19] can also be used to learn moveme models. A network with two lay- 
ers could be used functions <|)(x J of the input label x, and the output units compute lin- 
ear transformations of those hidden variables, y, = I;lV,,<|>>(x). The structure of the network 
is identical to that of the global polynomial interpolators, or the radial basis function net- 
works; the only thing that changes is the functional form of the nonlinearities c[),(x). In 
this type of network, the nonlinearities are sigmoidal activation functions: 



= g(wjx + bi) 



( 10 ) 



where 



= <>" 

Figure 10(a) shows the activation function of gia). Near zero, g is linear, until it satu- 
rates at a value of 1 above, and 0 below. In equation 10, w, and b, define a separating hy- 
per-plane in the input space (depicted for a 2-dimensional input space in the figure 
10(b)). The width of the linear region depends on the magnitude of wr, the smaller the 
magnitude, the larger the region. 

When a FFNN is used for pattern classification, any particular output of the network is 
‘high’ (value 1, say) when the input vector belongs to that class. For this type of appli- 
cation, the network produces useful computations because of the saturating property of 
the hidden units. Each hidden unit determines whether an input vector is on one side or 
the other of a hyper-plane boundary in the input space. The combined outputs of sever- 
al hidden units can be used to define intricate boundaries between different regions of 
the input space, each region representing a particular class of inputs. 

When a FFNN is used to learn a continuous function, the behavior is very different. In 
this application, it is the linear region of each hidden unit that is important. Over the lin- 
ear region, a hidden unit provides a linear approximation of the function being learned. 
In the saturation regions, it provides only a constant bias. A function is properly approx- 
imated only if the linear regions of each hidden unit cover the entire range of the input 
space. Because fo the fact that each hidden unit saturates above and below, FFNNs in- 
herently have difficulty extrapolating; beyond the input range specified by the training 
examples, it is likely that all hidden units become saturated. 

There are ways to overcome this difficulty. For example, the input range over which 
extrapolation is desired can be specified a priori, and during the network training proce- 
dure the hidden unit weights w. can be constrained to be small enough to guarantee that 
the linear region of each hidden unit is at least as wide as the desired extrapolation range. 
Another solution might be to use a new type of nonlinearity, which saturates only on one 
side. With such a nonlinearity, each hidden unit becomes a linear interpolant which 
‘turns on’ on just one side of a hyper-plane that divides the input space in two halves. 
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(a) 




(b) 



Figure 10. (a) The nonlinear sigmoidal activation function that generates hidden layer outputs <|>,-(jc) (b): the par- 
titioning of a 2-D input space into a linear region and two saturation regions. The width of the linear region is 
determined by the magnitude of the weight vector w. 



Besides the extrapolation deficiency, FFNNs have other characteristics that make them 
less desirable for motion learning. They are very slow to train, require the use of validation 
data to determine when to stop training the network to prevent over-fitting, and the trained 
network needs to have its weights regularized to guarantee that marker outputs are contin- 
uous through time. For these reasons, the use of FFNNs was not further explored. Never- 
theless, the analysis of FFNNs was fruitful, in the sense that it provides a clearer idea of 
the functional properties a good moveme model should have. Namely, the input space 
should be separated into regions, with each region activating a local (linear) approximator. 
Some of the regions must be unbounded, so that the system extrapolates properly. 
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FORM CONSTRAINTS IN MOTION INTEGRATION, 
SEGMENTATION AND SELECTION 



INTRODUCTION 

Perception is a process by which living organisms extract regularities from the physi- 
cal fluxes of varying physical characteristics in the external world in order to construct 
the stable representations that are needed for recognition, memory formation and the or- 
ganisation of action. The exact nature of the process is still not well understood as the 
type of regularities that are indeed used by sensory systems can be manifold. However, 
perception is not a process by which living organisms would reproduce the physical flux- 
es such as to build an internal representations identical to its physical counterpart. 

One issue then is to understand the kind of physical regularities that are relevant for 
perceiving and recognising events in the outside world. One way to address this question 
is to consider what we do not perceive. For instance, in vision, we are perceptually blind 
to individual photons and cannot see how many of them stroke the retinal receptors, or 
determine their exact wavelength, despite the fact that photons are the primary inputs 
that activate the visual brain. In contrast, we do perceive objects of all kind, even though 
they are seen for the first time, are partially hidden by other objects, or are seen under 
different illumination levels and altered by shadows. Thus, despite huge modifications of 
the photon flux, we are able to perceive and recognise the same objects. 



THE GESTALTIST APPROACH TO INVARIANT 

Early in the 20 lh century, Wertheimer (1912) analysed in details the theoretical conse- 
quences of the fact that two static flashes presented in succession with appropriate tem- 
poral and spatial separations elicit a perception of visual apparent motion. This observa- 
tion was taken as a strong indication that the visual system does more than simply reg- 
istering external events, suggesting that the “whole is more than the sum of its parts”. 
From this seminal analysis of apparent motion emerged a scientific attempt to define the 
general principles underlying that claim, known as the Gestalt psychology. 

The gestalt school, with the impulsion of Wertheimer (1912), Koffka (1935) and 
Kohler ( 1929), and later in the century Metzger (1934), Kanisza ( 1976) and a number 
of others, developed experimental paradigms to define and isolate the general rules un- 
derlying perceptual organisation. Using simple visual or auditory stimuli some princi- 
ples involved in perceptual organisation could be qualitatively assessed. In vision, fig- 
ure/ground segregation and perceptual grouping of individual tokens into a “whole” 
appeared to strongly rely on several rules such as good continuity, proximity, closure, 
symmetry, common fate, synchronicity etc. Most importantly, these principle define 
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spatial and temporal relationships between “tokens”, whatever the exact nature of 
these tokens: dots, segments, colour, contours, motion, etc. Implicitly, the general 
model underlying the Gestaltist approach is a geometrical one, stressing the spatial re- 
lations between parts rather than concerned with the intrinsic processing of the parts 
themselves. However, the attempt of the gestalt school to offer a plausible neuronal 
perspective that could explain their observations on perceptual organisation failed, as 
the Gestaltists were thinking in term of an isomorphism between external and internal 
geometrical rules whereby spatial relationships between neurones would mimic the 
geometry of the stimulus. Electrophysiological and anatomical studies did not re- 
vealed such an isomorphism. Yet, in this paper, we consider recent neurophysiological 
findings that suggest how geometrical principles may be implemented in the brain and 
discuss hypotheses about the physiological mechanisms that may underlie perceptual 
grouping. 



VISION THROUGH SPATIAL FREQUENCY FILTERS 

In contrast with the Gestaltist program of research, and following the psychophysical 
approach defined by Fechner at the end of the 19 lh century, another view of how the vi- 
sual system processes its inputs developed. The goal of this approach was to establish 
the quantitative relationships between the physical stimulus inputs and the perceptual 
outputs of sensory systems, in order to define and measure the elementary sensations 
processed by the human brain. In this atomist approach, less emphasis is put on percep- 
tual organisation and grouping and more on the early processing performed by the cen- 
tral nervous system, using threshold measurements as a tool to probe sensory systems. 
This approach requires to define a model of the stimulus, a model of the perceptual and 
neural processes at work and a model of the decisional processes needed to generate an 
observer’s response. Some powerful laws of perception could be demonstrated in this 
methodological framework, among which the Bouguer- Weber’s law on discrimination 
thresholds or the so-called Fechner’s law according to which sensation grows as a func- 
tion of the logarithm of the intensity of the physical inputs. This approach soon benefit- 
ed from the tremendous progresses in electrophysiological techniques in the mid of the 
20 ,h century that allowed the recording of the responses of cortical cells in response to 
visual stimulation. The discovery that retinal ganglion cells respond optimally to a re- 
stricted portion of the visual space, and are selectively activated by localised distribu- 
tions of luminance or chromatic contrast gave birth to the notion of a spatially limited 
receptive field that processes locally specific characteristics of the visual inputs (Hart- 
line, 1940; Hubei & Wiesel, 1968). Several classes of ganglion cells with different func- 
tional and morphological properties were identified, among which the magnocellular 
and parvocellular cells project through parallel pathways to the lateral geniculate nucle- 
us and from there to primary visual cortex and higher areas. These new findings led psy- 
chophysicists to look for the perceptual consequences of the existence of these cells, as 
the spatio-temporal structure of their receptive fields provided strong experimental evi- 
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dence to determine a model of the stimulus relevant to biological vision. In their semi- 
nal study, Campbell & Robson (1968) offered that the centre-surround receptive field 
structure is well suited to process the incoming retinal inputs in a way analogous to that 
proposed by the French mathematician Fourier to analyse periodic events. Fourier’s the- 
orem states that any complex event can be mathematically decomposed into a simple 
sum of sinusoidal waves of different frequency, amplitude and phase. According to 
Campbell & Robson, the ganglion cells in the retina would decompose visual inputs in- 
to a spectrum of spatial frequency bands, each being selectively processed by a sub-pop- 
ulation of neurones. The idea that the visual system transforms a spatial distribution of 
light into a set of different spatial frequency bands had a huge impact on subsequent re- 
search. Since then, the model of the stimulus considered as relevant to study the visual 
system was based on the linear Fourier decomposition. Consequently, the most elemen- 
tary stimulus, represented by a single point in the frequency domain, is a sinusoidal dis- 
tribution of luminance contrast, consisting of a simple extended oriented grating. One of 
the great advantage of this model was that the Fourier transform is a linear operation, 
and therefore appeared as a powerful tool to determine whether the visual system itself 
behaves as a linear spatial frequency analyser. 

A Fourier decomposition of a two-dimensional (2D) image result in both an energy 
spectrum, describing the distribution of amplitude in different spatial frequency bands at 
different orientations, and a phase spectrum, that contains information about the absolute 
phase of different spatial frequencies and orientations, and that represents the position of 
different spatial frequencies relative to an arbitrary origin (figure 1). 

In practice, both psychophysicists and electrophysiologists used extended gratings of 
different spatial frequencies, orientation and contrast to probe the visual system, as they 
were mostly concerned with the effects of the energy spectrum on contrast sensitivity 
and neuronal selectivity but were less concerned with the phase spectrum. However, 
simple cells in primary visual cortex were found to respond to the spatial phase, the po- 
sition of a grating within their receptive field, and psychophysical studies showed that 
observers rely heavily on the phase spectrum to recognise objects and scenes, and to a 
lesser extent on the energy spectrum. For instance, when blending, through image syn- 
thesis, an energy spectrum of an image A with the phase spectrum of an image B, image 
B is more easily recognised than image A (Piotrowski & Campbell, 1982). 

One issue with the representation of images in the Fourier domain is that the position 
and geometrical relationships between different parts of an image, although they are 
somehow embedded in the phase spectrum, are difficult to visualise and analyse. This is 
mainly because the phase of each spatial frequency component is expressed relative to 
an arbitrary origin, but does not represent directly the relative phase between different 
spatial frequencies that would give information about their spatial relationships. Conse- 
quently, researchers have often discarded the phase spectrum, and thus the geometry, 
when analysing their stimuli. More recently, the development of multiscale analysis and 
wavelet transform that use filters well localised both in space and in the frequency do- 
main provided new tools to describe the morphological properties of images while mod- 
elling more accurately the response profile of cortical cells. However, the extraction of 
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Figure 1. An image and its Fourier transform that describe the amplitude and orientation of the frequency spec- 
trum (left) and the phase spectrum (right). 
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the geometrical properties of images with the wavelet transform remains a difficult prob- 
lem (Mallat, personal communication). 

The strong electrophysiological evidence suggesting that visual neurones with spatial- 
ly restricted receptive fields indeed respond selectively to spatial frequency led to the 
idea that the visual cortex consist in a set of spatial frequency analysers working inde- 
pendently and in parallel (De Valois & De Valois, 1988). Accordingly, a complex input 
image activates a large population of cells distributed across the cortical surface, each 
analysing a different region of the space and processing a specific spatial scale of the 
stimulus input. Given this distributed representation of an image in the visual cortex, im- 
portant issues arise: how does the brain aggregate the numerous neuronal responses to a 
complex object in order to segregate it from its background? How and under what con- 
ditions does the visual system bind together these distributed responses while avoiding 
spurious combinations? What are the rules and physiological mechanisms involved to 
account for this perceptual organisation? 

In the following, we very briefly describe the organisation of the visual brain and pres- 
ent some of the challenging issues that emerge. We then present and analyse experi- 
mental results related to this issue using a simple experimental paradigm that proved 
useful to uncover the role of geometrical constraints in perceptual grouping and to un- 
derstand how the brain builds up perceptual moving objects from their moving parts. Fi- 
nally, we discuss some physiological mechanisms that may account for the experimen- 
tal findings. 

The recent progress in our knowledge of both the anatomy and physiology of the visual 
brain indicates today that it consists of an elaborately interconnected network of over 30 vi- 
sual areas with highly specialised functional roles (Van Essen & De Yoe, 1995). Their or- 
ganisation is generally thought of as a roughly hierarchical structure in which neural com- 
putations become increasingly complex. According to this view, neurones at the lower lev- 
els of the hierarchy would process elementary characteristics from relatively small regions 
of visual space. The results of these analyses are then passed onto and processed by higher 
level units that integrate information across larger spatial and, possibly, temporal extents. 
Anatomical and physiological evidence support this convergence scheme. For instance, the 
responses of rods and cones to distribution of light intensities are combined both spatially 
and temporally through convergent projections to build up the centre-surround receptive 
field of ganglion cells tuned to spatial frequency. Further combination of the outputs from 
lower level detectors would then explain the processing of orientation, colour, movement 
etc. (Hubei & Wiesel, 1968). According to this view, at each cortical level, detectors re- 
spond to those features to which they are preferentially tuned, within a fixed location in reti- 
nal space. Moreover, neurones tuned to different dimensions such as motion, colour or form 
are located in distinct areas distributed along two parallel pathways, often viewed as the ex- 
pression of a “what and where'’ or “perception/action'’ dichotomy. This stems from a num- 
ber of electrophysiological recordings of cells in areas distributed along these pathways and 
from psychophysical and neurological studies, showing that motion processing and oculo- 
motor control were specific to the dorsal pathway, while colour and form analysis would 
be mostly performed in the ventral pathway. 




176 



JEAN LORENCEAU 



CONTOUR INTEGRATION AND LONG-RANGE HORIZONTAL CONNECTIONS 

In the primary visual cortex, the receptive fields of visual neurones are organised in a 
retinotopic fashion, such that neighbouring neurones analyse near regions of the visual 
field. Neurones in a single column perpendicular to the cortical surface are selective to 
the same orientation while orientation selectivity changes smoothly from one column to 
the next resulting in hypercolumns where neurones are processing neighbouring regions 
of the visual space. This pattern suffers exceptions however, as it was found the cortical 
surface present singularities, called pin-wheels, where neurones change rapidly their ori- 
entation selectivity as well as their positions in visual space. 

This organisation suggested that the brain processes its visual inputs in parallel through 
spatially limited receptive fields insensitive to remote influences. This view has recent- 
ly been challenged by physiological studies showing that the responses of VI neurones 
to oriented stimuli presented within their receptive field can be markedly modulated by 
stimuli falling in surrounding regions, which by themselves fail to activate the cell. 
These influences could be mediated by cortico-cortical feedback projections from high- 
er cortical areas as well as by long-range horizontal connections found within VI. In- 
deed, these horizontal connections link regions over distances of up to 6-8 mm of cor- 
tex, tend to connect cells with similar orientation preferences and more specifically cells 
whose receptive fields are topographically aligned along an axis of colinearity. Thus, this 
circuitry -feedback and long range connections within a single area- provide a possible 
physiological substratum to compute some of the geometrical properties of the incom- 
ing image. 

In support of a functional link between neurones through horizontal connections in 
primary visual cortex, a number of recent psychophysical studies uncovered strong 
contrast dependent centre-surround interactions, either facilitatory or suppressive, that 
occur when one or several oriented test stimuli are analysed in the presence of sur- 
rounding oriented stimuli. For instance, contrast sensitivity is improved by similar 
flankers, collinear and aligned with the test stimulus. Changing the relative distance, 
orientation, spatial frequency or contrast of the flankers modulates the change in sen- 
sitivity, allowing the analysis of the architecture of these spatial interactions (Polat & 
Sagi, 1994). In addition, the ability to detect the presence of a specific configuration 
of oriented bars immersed within a surrounding textures of randomly oriented ele- 
ments with similar characteristics is better for configurations of collinear and aligned 
elements than for parallel configurations. Field & collaborators (1993) proposed that 
perceptual association fields are involved in this contour integration process, and sug- 
gested that the architecture of horizontal connections may underlie these effects. This 
notion of association field is supported by studies showing that these interactions are 
decreased or suppressed in amblyopic patients who suffer from a disorganisation of 
long-range connectivity. Overall, these studies are compatible with the view that long- 
range connections play a functional role in perceptual contour integration, and further 
suggest that they may constitute a physiological substrate that implement some of the 
gestalt rules at an early processing stage. 
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INFLUENCE OF SINGULARITIES IN MOTION INTEGRATION, 

SEGMENTATION AND SELECTION 

It has long been known that object motion or self motion can elicit a perception of a 
form -structure from motion- that would not be recognised if the retinal image was stat- 
ic, as is the case with biological motion (Johansson, 1950) or rotating three dimension- 
al clouds of dots. However, less is known on the influence of form on motion percep- 
tion. 

We now briefly describe the results of psychophysical experiments concerned with the 
influence of perceptual interactions between form and motion processing on motion in- 
tegration, segmentation and selection. We present evidence that motion grouping relies 
heavily on the processing of local singularities such as junctions and line-ends, and on 
more global properties of objects such as collinearity, closure and surface formation, i.e. 
geometrical properties of the stimulus. In addition, experimental evidence suggests that 
form /motion interactions do not result from a convergence in late visual areas of motion 
and form information conveyed by the dorsal and ventral pathway, but already occurs at 
an early processing stage. 

An object’s motion is analysed in primary visual cortex by direction selective neurones 
with oriented receptive fields of limited spatial extent. It is easy to show both theoretically 
and experimentally that these neurones are unable to accurately signal the physical direc- 
tion of an oriented contour that crosses a neurone’s receptive field (Henry & al., 1972; 
Fennema & Thompson, 1979). The inaccuracy of these motion selective neurones occurs 
because they cannot process the motion component parallel to the contour, which by it- 
self does not produces any change in the input to the cell, and can therefore only respond 
to the component perpendicular to the preferred cell orientation, a limitation known as the 
“aperture problem”. There are several ways to overcome this problem. One is to rely on 
the richer information available at contour extremities, where the existence of singulari- 
ties provides sufficient cues to solve the aperture problem in a small region of the visual 
field. Another possibility is to combine the ambiguous neuronal response elicited across 
space by moving contours with different orientations and to compute the physical direc- 
tion according to some combination rule (Adelson & Movshon, 1982; Wilson & Kim, 
1994). Note that the law of common fate proposed by Gestaltists, in which components 
moving in the same direction with the same speed are bounded together and interpreted 
as belonging to the same object (Koffka, 1935) cannot account for motion grouping, as 
this simple rule is insufficient to constrain a unique interpretation of visual motion. In- 
deed, the common fate principle implicitly assumes that visual neurons analyse 2D mo- 
tion whereas most cortical neurons signal only one-dimensional (ID) motion (see above: 
the aperture problem). In addition motion in a three-dimensional (3D) space projects on 
a two-dimensional (2D) retinal space. Thus, identical retinal motion may correspond to 
different trajectories or conversely movements in different directions with different 
speeds may correspond to a unique motion of a single object. 

To study how local motion signals are integrated into a global object’s motion and to de- 
termine the contribution of form information in solving the aperture problem, Lorenceau 
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& Shiffrar (1992) designed a simple paradigm in which simple geometrical shapes were 
presented behind rectangular windows while moving smoothly along a circular trajecto- 
ry (see figure 2a). 

A 





Figure 2. A: Stimulus used in the experiments: an outlined diamond is seen behind windows that conceal the 
vertices at all time. Moving the diamond along a circular path in the plane (large arrow in the centre) results in 
a vertical translation of component segments within each window (small arrows). Since each segment does not 
provide rotational information, integration across space and time is necessary to recover object’s motion. B: 
Percentage of the trials where observers successfully recovered the clockwise or anticlockwise direction of mo- 
tion, as a function of the different conditions tested (see text for details). 



Under these conditions, a single contour segment visible in each window does not pro- 
vide enough information to determine the global direction of object motion: each seg- 
ment appears to move back and forth within each window with no rotational component. 
In order to recover the global circular object’s motion it is necessary to group and com- 
bine the different segment motions. This stimulus thus offers a simple tool to test 
whether human observer can or not combine segment motion across space and time, in 
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situations where singularities such as vertices do not provide a direct relevant informa- 
tion. Altering this occluded stimulus by changing the geometry between its constituent 
segments or by altering the information provided by occlusion points at the border be- 
tween the window and the moving form, offer a way to assess the role of these features 
on perceptual grouping. 

When looking at a revolving diamond under these “aperture viewing” conditions, a 
striking observation is that whenever the apertures are visible, either because they 
have a different luminance from the background or because they are outlined, the di- 
amond appears rigid and moving coherently as a whole along a circular path whose di- 
rection can easily be determined. Decreasing the contrast between the apertures and 
the background decreases the perceived coherence and observers have trouble dis- 
criminating the diamond’s direction. When the apertures and the background have the 
same hue and luminance, observers report seeing a jumbled mess of four moving seg- 
ments. Clear perceptual transitions between a moving whole and its moving parts can 
be induced by this contrast manipulation, although at a given contrast level, such per- 
ceptual transitions also occur spontaneously over time. To get insights into this phe- 
nomenon and test different potential explanations, we modified the salience of line 
ends, either by using jagged apertures, that alter the salience of line-ends due to sym- 
metrical and rapid changes in contour length during the motion, or by changing the lu- 
minance distribution along the contour (i.e. high luminance at the centre and low lu- 
minance at the line ends or the reverse). As a general rule, we found that motion co- 
herence and discrimination performance improve as terminator salience decreases. 
Similar improvement in performance is observed when the overall contrast of the seg- 
ments decreases, suggesting the existence of a threshold above which singularities are 
resolved and the diamond segmented into parts. These observations show that singu- 
larities -junctions, end points, vertices - exert a strong control on perceptual integra- 
tion of component motion over space and time, and are used to segment objects into 
parts. 

Eccentric viewing conditions produce dramatically different results. Whatever the 
aperture visibility, the diamond always appears as a rigid object whose direction is ef- 
fortlessly seen. This effect is not easily explained by an increase of receptive field sizes 
with eccentricity, since we found that reducing the size of the stimuli has little influence 
on perceived coherence in central vision. Rather, the effect of eccentric viewing condi- 
tions could reflect the relative inability of peripheral vision to resolve local discontinu- 
ities. A summary of these different results is presented in figure 2b. Performance in a 
forced choice direction discrimination task is plotted as a function of the different con- 
ditions tested. 

At this point several hypotheses that could be invoked to account for these phenomena 
can be discarded. For example, the idea that motion integration is facilitated with visible 
as compared to invisible apertures because the former, but not the later, provides a stat- 
ic frame of reference cannot explain why low contrast stimuli are perceptually coherent 
in central vision when the windows are invisible. Also, neither the idea that human ob- 
servers use a constraint of rigidity to recover object motion, nor a significant role of at- 
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tention in binding is supported by our results: Prior knowledge that a rigid diamond is 
moving does not help to determine its direction and attentional efforts to glue the other- 
wise incoherent segments into a whole coherent perception are useless. A complementa- 
ry demonstration with a stereoscopic diamond stimulus support the view that early pars- 
ing of the image relies on 2D discontinuities and depth ordering: if a high contrast dia- 
mond has a positive disparity relative to the apertures and thus appears in front, its mo- 
tion appears incoherent, whereas negative disparities, inducing a perception of a dia- 
mond moving behind the apertures, produce a highly coherent perception of a rigidly 
moving object. Thus, despite the fact that the monocular image is identical in both con- 
ditions, the perceptual outcome is dramatically different. This effect brings additional 
support to the hypothesis that changes in terminator classification and depth ordering 
regulates the transitions from motion integration to motion segmentation. 



GLOBAL FORM CONSTRAINTS IN MOTION INTEGRATION 

Although the experiments described above indicate that the processing of singularities 
provide strong constraint on motion grouping, they do not directly address the role of 
more global geometrical properties in motion integration. To answer this question, it is 
necessary to modify the spatial relationships between the constituent of a shape, without 
modifying the singularities or the total energy spectrum of the stimulus. In this way one 
can ascertain that the potential effects of form on motion integration is not caused by dif- 
ferences in the processing of end-points or in the distribution of Fourier energy in dif- 
ferent spatial frequency bands. One way to do this is to permute the positions of the ob- 
ject’s parts without modifying the apertures or the segment characteristics. This was 
done in a series of experiments using outlines of a variety of simple geometrical shapes, 
shown in figure 3a, such as a diamond, a cross or a chevron, etc. Eight different shapes 
were used, all constructed with the same component segments, but with different spatial 
distributions. Note that the energy spectrum of these different shapes is highly similar, 
the only important differences lying in the phase spectrum. 

Thus, any difference in the ability to recover the coherent global motion of these dif- 
ferent occluded shapes should be due to differences between their phase spectra. We then 
ask observers to indicate the clockwise versus anti-clockwise motion of these shapes 
when seen behind windows that occlude their vertices. Surprisingly, the performance of 
human observers in the global motion discrimination task strongly depends on which 
shapes is shown (figure 3b). As a general rule, “closed” figures made of relatable seg- 
ments (see Kellman & Shipley, 1991), for instance the diamond, yield much better per- 
formance than “open” figures, such as a cross or a chevron, for which observers hardly 
recover the global direction of motion. In addition, observers report the closed figures as 
being highly coherent shapes moving as a whole, whereas open shapes appear as non 
rigid sets of line segments moving incoherently. 

These results strengthen the view that contour and motion binding depends mainly on 
the phase spectrum of these stimuli but little on their energy spectrum. What is it about 
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Figure 3. A: Stimulus used in to uncover the role of form on motion integration. Different shapes made up of 
identical segments with different spatial distributions are used. Note that the energy spectra of these stimuli are 
highly similar whereas the phase spectra differ. B: Percentage of the trials where observers successfully recov- 
ered the clockwise or anticlockwise direction of motion, as a function of the different shapes tested. The results 
show that different shapes with identical segment motions yield different performance. 
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the ‘difficult’ shapes that makes motion integration so troublesome? To test whether this 
difficulty results from a lack of familiarity with the occluded stimuli, we conducted ad- 
ditional experiments where observers practice the task during a number of sessions or 
are presented fully visible static shapes for one second before each test trial. Results 
show that training and knowledge of the shapes in advance does not facilitate global ro- 
tation discrimination for difficult shapes (Lorenceau & Alais, 2001). Since performance 
seems immune to the influence of cognitive strategies arising from prior knowledge of 
the stimulus, these finding suggest that the origin of the limiting factor rendering motion 
integration more difficult for these shapes lies at a low-level in the visual system. 

It was noted above that motion and form are processed in parallel within the parvocel- 
lular and magnocellular pathways. In addition to their different specialisation for form 
and motion, these pathways also respond differentially to several stimulus variables, par- 
ticularly to contrast and speed. By varying these parameters it is thus possible to alter the 
relative contributions of the two pathways to visual processing, and assess their respec- 
tive contributions. In particular, the poor sensitivity of the ventral “form” pathway to lu- 
minance contrast and speed permits to create stimuli which would favour the dorsal 
“motion” path at the expense of form processing. This was done by reducing the lumi- 
nance of the contours which define the stimuli and by doubling the speed of rotation. 
Given that the difference in global motion discrimination between the shapes seems sim- 
ply to be a matter of geometrical form, we expected that this would reduce the difference 
in performance between ‘easy’ and ‘difficult’ shapes. Reducing contour luminance re- 
sulted in a dramatic improvement in performance on the global rotation task for the dif- 
ficult shapes, with good performance for the cross and chevron. Importantly, speed also 
interacted with stimulus shape, such that performance for the cross and chevron was no- 
ticeably better at the higher speed. Thus, not only does overall performance improve as 
stimulus conditions increasingly favour the dorsal “motion” pathway, but the distinction 
previously seen between easy and difficult shapes is progressively attenuated. 

These findings show that reducing the contribution of the form pathway reduces the dif- 
ferences between easy and difficult shapes. More specifically, it is performance for the dif- 
ficult shapes which improves most, rising toward the near-perfect level of the diamond 
shape. This confirms that the difference between the shapes really is simply a matter of 
geometrical form, since, once the influence of form information is reduced, all of the 
shapes are essentially identical in terms of their spatiotemporal content and thus produce 
the same global motion solution. This points to a strong interaction between form and mo- 
tion processing, whereby the form processing path can exert a strong suppressive influence 
on the motion pathway, determining whether or not local motions are integrated into co- 
herent global motions. This influence of form on motion could result from late interactions 
between the “form” and “motion” pathways, as neuronal selectivity to shapes such as dia- 
monds, crosses and chevrons is found in the ventral pathway, whereas neurones detecting 
the kind of rotary motions used here are found in the dorsal pathway. However, several as- 
pects of the present data, such as the strong contrast dependence, the absence of priming 
effects, and the lack of learning for difficult shapes, suggest an early form/motion interac- 
tion. In addition, we propose that there is something about the difficult shapes which actu- 
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ally impedes motion integration. Our data suggest that the role of form information is to 
regulate whether motion integration should go ahead or not: contours forming convex, 
closed forms (good gestalts) would favour motion integration, while open, convex forms 
would trigger a veto from the form system which would prevent motion integration. Giv- 
ing the form pathway a right of veto over motion integration would help prevent the mo- 
tion system from integrating local motions which do not belong together, which is espe- 
cially a risk when they are spatially separated or partially occluded. Integration in occlud- 
ed regions must be sensibly constrained to prevent spurious motion integration, and a 
form-based veto could do this. A decision to veto motion integration would need to be 
made early and would explain the observation that performance with ‘difficult’ shapes re- 
mained poor despite extended practice or priming with complete shapes. If motion inte- 
gration were vetoed early for ‘difficult’ shapes, no learning could take place in higher-lev- 
el form areas. This suggests that the influence of form on motion could already take place 
between the magno and parvo streams that are known to interact as early as V 1 . 



FORM AND MOTION SELECTION 

In the experiments presented so far, the perception of a moving “whole” is contrasted 
with the perception of its parts. Although this design permits to shed light on the process- 
es involved in the integration and segmentation of component motions, it is not well suit- 
ed to address the problem of selection, by which the visual system should decide which 
and when local motions must be bound with others. Consider instead the stimulus shown 
in figure 4a, which consists in two overlapping shapes moving in different directions. If 
these stimuli are seen behind small apertures, such that only straight segments are visi- 
ble, the activity elicited in orientation and direction selective cells in the primary visual 
cortex - but this would also be true for any local motion sensor - might resemble the pat- 




Figure 4. Illustration of the problem of selection: two partially occluded overlapping figures moving in differ- 
ent directions (right) elicit a response from a collection of neurones in primary visual cortex. The visual sys- 
tem must select and combine the responses to one object while discarding the responses to the second object 
and avoid spurious combination so as to correctly recover the motion of objects in the scene. 
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tern presented in figure 4b, in which a population of neurones face the aperture problem, 
i.e. responds to the motion component motion orthogonal to the preferred orientation. 

The question, then, is not only to determine whether the neural responses to individual 
component motions should be bound together or not, but also to determine which re- 
sponses should be bound together while avoiding spurious associations between the re- 
sponses elicited by the contours of a different object. This, in principle, could be done 
by determining which component motions are mutually consistent. However, when ob- 
servers are shown this stimulus and asked how many objects are present or in what di- 
rection they move, they have difficulty to answer. The perception if that of a single non- 
rigid flow grossly moving in the average component direction. This observation suggests 
that observers cannot select component motions that belong to the same rigid object on 
the sole basis of the mutual consistency of the component directions, so as to segment 
these motions from the remaining inconsistent moving contours. Presumably, other con- 
straints -or prior assumptions- must be used to solve this binding problem. Amongst 
them, the constraints related to form information appear to play a critical role. This pos- 
sibility was tested using two moving diamonds, partially visible behind windows that 
concealed their vertices at all times. This two-diamond stimulus may help uncover the 
constraints involved in motion selection as it is inherently ambiguous, so the different 
perceptions it may elicit can reveal the characteristics of motion selection processes. 
When this stimulus is static, it can be decomposed into a small diamond surrounded by 
a large one or into two overlapping diamonds of the same size (Figure 5). However, ob- 




Figure 5. Stimuli used in to uncover the role of form on motion selection. A. When static this stimulus yield two 
distinct perceptual organisation: two overlapping diamonds of the same size or a small diamond embedded in a 
large one. Observers spontaneously choose the later solution. B: when both diamonds are moving in opposite di- 
rections, different motion combination are possible: depending on which motion signals available in each aper- 
ture are selected one can see incoherent motion of individual segments (no integration), coherent motion in depth 
or coherent translation in the plane. See text for details. The results indicate that observers favour the grouping of 
segments forming closed figures, whatever the motion percept implied by this grouping strategy. 
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servers spontaneously perceive two diamonds of different sizes, rather than two identi- 
cal diamonds. This is expected however, because proximity, good continuation and clo- 
sure, known to be powerful cues for grouping static contours, favour this interpretation. 

To test whether these rules are also used to drive motion selection, we designed mov- 
ing versions of this two diamond stimulus. If two identical diamonds oscillate back and 
forth in opposite direction behind windows (Figure 6a), several interpretations are pos- 
sible, depending on how the motion signals available through the apertures are com- 
bined. The perception of a small expanding diamond surrounded by a contracting dia- 
mond could emerge if component motions in the centre were bound together, as in the 
static version of this stimulus, and segmented from the outer component motions that 
would on their own yield the perception of a large contracting diamond. Alternatively, 
the component motions of diamonds with identical size could be grouped by similarity, 
yielding the perception of two overlapping diamonds translating in opposite directions 
in the plane. Other possibilities - absence of grouping, selection by proximity within a 
window - also exist and can elicit different interpretations. Simple experiments were 
done to determine what is the dominant perceptual organisation of motion. Observers 
were asked to report whether they saw two unequal diamonds moving in depth - ex- 
panding and contracting - or two identical diamonds translating back and forth in the 



A 



B 




Figure 6. Results of a forced choice experiment in which observers were required to indicate whether they per- 
ceived two diamonds expanding or contracting over time, or two diamonds translating back and forth in the 
fronto-parallel plane. Observers’ choice depends on which segments are grouped. The results indicate that ob- 
servers based their choice on the perceived form as they always group the four central segments yielding a 
small and a large diamond. This spatial configuration is then either seen as expansion and contraction or as two 
translations in opposite directions. 
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fronto- parallel plane - to the left and to the right The results (figure 6b) clearly show 
that observers systematically report seeing a small and a large diamond moving in depth, 
but rarely perceive two translating diamonds or other motion combination. If one uses 
instead a large diamond surrounding a small one, so as to display the same spatial con- 
figuration of eight static segments and then apply the same horizontal oscillation, the 
perception of motion changes dramatically: observers no longer see motion in depth but 
report seeing a small and a large diamonds translating in opposite directions in the same 
plane, although the perception of two diamonds of equal size moving in depth would be 
an equally possible interpretation. Thus, the same sets of four segments were selected in 
both configurations -same size or different diamond sizes -, resulting in very different 
perception of motion. This suggests that motion signals are not selected on the basis of 
the sole motion information and that observers are not biased toward a specific interpre- 
tation, for instance because motion in depth would be more relevant for the organism. 
One explanation is that aspects of static forms, such as collinearity, alignment and clo- 
sure strongly determine which signals are selected to drive the motion integration/seg- 
mentation process. 

Altogether, these results powerfully demonstrate the critical role played by geometri- 
cal information in global motion computation. Local singularities such as vertices, junc- 
tions or line-ends appears to exert a strong control on the balance between integration 
and segmentation as salient contour terminators appear to be used to parse the image in- 
to parts. Global geometrical image properties also appear to provide strong constraints 
on the integration process, as integrating moving contours into a global motion is a sim- 
ple task for some configurations (diamonds) while it is difficult for others (crosses and 
chevrons). The observation that extrapolation of the contours of these different shapes 
also produces two distinct classes of stimuli provides insights to account for this di- 
chotomy. In the case of the diamond, contour extrapolation produces a closed, convex 
shape, whereas open concave shapes are produced in the case of the cross or the chevron. 
The closure inherent in the diamond’s form may provide a reason for its superiority over 
the other shapes. Closure of the diamond by amodal completion (Kanisza, 1979), to- 
gether with the filling-in of its interior this may engenders, would serve effectively the 
segregation of the diamond from its background. Consequently, judging the diamond’s 
direction of rotation would be much easier than for open shapes which generate poorer 
responses at the level of object representation. The available neural evidence suggests 
that these processes of completion, filling-in, and figure/ground segregation are initiat- 
ed early in visual processing. Cells in VI have been shown to respond to contours ren- 
dered discontinuous by occlusion (Sugita, 1999), and VI cells are also capable of re- 
sponding to filled-in areas and not just to their borders (Komatsu et al., 1996). Brain im- 
aging has also revealed figure/ground segregation as early as VI (Skiera et al., 2000). 
Moreover, neural correlates of the boundary-initiated surface formation process de- 
scribed above have been observed in VI (Lamme, 1995). 

The effects of boundary completion, filling-in and figure/ground segregation, can all be 
considered broadly under the rubric of form processing. Our data suggest that the role of 
form information is to regulate whether motion integration should go ahead or not. 
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It is likely that cooperative interactions between neighbouring contours observed in 
area VI (Gilbert, 1992; Kapadia et al., 1995) can provide the cortical substrate to explain 
the influence of form on motion grouping, possibly by conducting a pre-shaping of ele- 
ments into proto-forms. Evidence from physiology and psychophysics points to low-lev- 
el mechanisms being extensively involved in contour completion, filling-in and fig- 
ure/ground segregation. 

A simple model that synthesise the present findings is shown in figure 7. Depending on 
the salience of singularities and on their spatial configurations, contours would be 
grouped through horizontal connections in primary visual cortex (VI), area V2 would 
further classify the depth relationships at occlusion points and provide modulating inputs 
to the MT/MST complex in the dorsal pathway which could in turn and depending on 
the results of this initial processing, integrate or segment the selected motion signals. 



Perceptual motion integration and segmentation 




VI 

Horizontal Connections 
Contour completion. Filling-in 
Line-ends salience 



^ 0 




Figure 7. Hypothetical model of form and motion binding. Long-range horizontal connections in V 1 would link 
relatable contour segments depending on the salience of their end-points and on the collinearity and alignment 
between them. V2 would further process and classify singularities -vertex, junction, occlusion point- in the im- 
age. MT/MST would compute partial solution to the aperture problem, depending on the inputs from VI and 
V2 that would “tag” the signals selected for further motion analysis. Feed-back from MT could help maintain 
a viable solution. 



CONCLUSION 

The present paper summarized some of the numerous studies that converge to support 
the idea that geometrical relationships between visual elements or “tokens”, as initially 
stated by the Gestaltists, play a fundamental role in the perceptual organisation of form 
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and motion. Recent developments in neuroscience and the available anatomical and 
physiological evidence suggest that the neuronal circuitry described in the primary vi- 
sual cortex possesses some of the properties needed to process the geometrical charac- 
teristics of the retinal inputs. This is certainly not the whole story, however: many other 
aspects of form and motion, that are selectively processed in areas distributed along the 
dorsal and ventral pathways, may also play a role. In addition, attention and prior knowl- 
edge could modulate perceptual grouping, although the present experiments failed to 
demonstrate such influence. Finally, the fact that motion can by itself provide sufficient 
information to segregate and recognise the form of objects indicates that interactions be- 
tween form and motion are bi-directional. Future studies will with no doubt shed light 
on the intricate relationships between the processing of motion and form. 

Prof. Jean Lorenceau 
UNIC-CNRS 
Avenue de la Terrasse 
91198, .Gif-sur-Yvette, France 
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SCINTILLATIONS, EXTINCTIONS, 
AND OTHER NEW VISUAL EFFECTS 



1. INFLUENCES 

After completing studies in mathematics and engineering, I turned to molecular biolo- 
gy, and became interested in the problem of how molecules recognize each other. In 
many molecular processes, there are crucial stages in which a molecule has to select a 
cognate partner, among many non-cognate ones. Sometimes errors are made, but on the 
whole, the selection procedures are remarkably accurate. For instance, the error-rate in 
the reproduction of DNA can be as low as 2x10 111 in some cells, per monomer incorpo- 
rated. In the 1970’s there were a number of puzzling observations on the patterns of er- 
rors made in mutant cells exhibiting either higher accuracy, or lower accuracy than stan- 
dard cells. The prevalent doctrine was that the errors were due to error-prone processes, 
which operated in parallel with the normal error-free process. Inspired, in part, from 
readings in psychology and psychoanalysis, I was inclined to consider errors as products 
of the normal process, signatures which revealed its inner workings. I thus showed, by a 
simple mathematical analysis, that the error-patterns could be interpreted in this way [1], 
and developped a body of ideas on how accuracy could be achieved in molecular 
processes [2], In this field, my name is associated to the name of John Hopfield [3], a 
physicist now famous in cognitive psychology for his contributions to neural network 
theory [4], 

While working in a molecular biology laboratory, I was reading books and articles on 
vision and the brain. Three books, by Karl von Frisch [5], Bela Julesz [6] and Richard 
Gregory [7] made a lasting impression on me. From von Frisch, I learnt not to take an 
experiment at face value: bees do not discriminate between red and black, yet they have 
color vision. They distinguish two whites, identical to our eyes, on the basis of their ul- 
traviolet content. (Later, I found that a similar result had been established, much earlier, 
by Lubbock on ants). From Julesz, I learnt that one could do experiments probing the in- 
ner workings of the brain, using carefully designed images. From Gregory, I learnt all 
about constancies in vision, and how much of a paradox stable vision was. I was also im- 
pressed by his style, as a scientific writer. 

However, the book which gave me a real opportunity to join the field, was a more aca- 
demic one, a synthesis on visual illusions by Robinson [8], This book contained an ex- 
haustive description of the known geometrical visual illusions, and a summary of most, 
if not all theories put forward to explain them. None of these theories satisfied me, and 
I thought there would be room for a fresh attack, in the line of my work on accuracy in 
molecular biology. Geometrical illusions, far from being the result of error-prone 
processes in the brain, would, on the contrary, be the signature of intelligent procedures 
to represent shape and spatial relationships. A map of a portion of the earth may look dis- 
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torted, yet it may have been constructed according to a rigorous mathematical algorithm. 
The distorsions come from the need to accomodate the constraints (in the case of the ge- 
ographic map, one has to represent a spherical surface on a planar one), and not from the 
inadequacy of the mapping procedure. 1 therefore dwelled upon the geometrical prob- 
lems of vision, and worked both on geometrical visual illusions, and stereoscopic vision, 
first theoretically [9, 10] then experimentally (e.g., [11,12]). 

Ultimately, I became a professional in the field, and attended the European Congress- 
es on Visual Perception (ECVP). At these congresses, there was a predominance of talks 
by scientists from English-speaking countries, and the Italians were often relegated to 
the posters (the situation has improved, since), but it was often there that I found the 
most creative visual stimuli, there that I took the measure of the Italian contributions to 
the field. There was a strong trend, among the ruling psychophysicists, to describe te- 
dious experiments, made on boring visual stimuli that involved just three points, or three 
line segments, or two gratings, or even worse, two Gabor patches. I considered that vi- 
sion has to deal with a 3d world, which is seized with mobile eyes attached to a mobile 
head. Visual spatial analysis then requires, to perform correctly, inputs with some mini- 
mal complexity. Seven or eight anchoring points seem to be a strict minimum for 3d spa- 
tial analysis (see, [9, 13]). With stimuli lacking complexity, the normal visual algorithms 
may not work properly, and what one studies then is perhaps a “default” setting of the 
visual system. I thus became increasingly attentive to the astute visual stimuli designed 
by scientists from the Italian school (see, e.g., the collection of contributions in [14, 15]). 
I began to develop ties with several of them, and I realized how much this school owed 
to Gaetano Kanizsa. Last, but not least, I became familiar with Kanizsa’s work in its 
globality [16, 17], and took the measure of the depth of his thinking. Above all I appre- 
ciated his way of embodying his conceptions into striking visual examples. Mathemati- 
cians, dealing with a conjecture, the proof of which appears beyond reach, occasionally 
defeat it by the discovery of a single counterexample, Kanizsa had the art of construct- 
ing with maestria, the right counter example to defeat the too simplistic explanations of 
the phenomena he was interested in. 

Rather than embarking into wordly discussions of, say the interrelationship between 
top-down and bottom-up streams in visual analysis, I will more modestly, introduce, 
with minimal comments, a few of my favourite images: Images which would have per- 
haps elicited inspired comments from Kanizsa. 



2. TEXTURES, AND SUBJECTIVE CONTOURS 

At the beginning, I was greatly influenced by Julesz, and was an admirer of his cam- 
ouflaging textures involving random lattices of black and white squares, used in stereo- 
scopic stimuli. However, real-life scenes contain edges at all orientations, and I sought 
to design camouflaging textures rich in orientations. I thus produced “random-curve 
stereograms” [18]. There, a 3d surface is represented by a computer-generated random 
curve, or by a distorted lattice. In spite of the low density of lines on the surface, it is 
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perceived as essentially complete, and one can perhaps classify it as a kind of subjective 
surface. 

Later, I tried to produce camouflaging textures manually, and some of my efforts are 
reminding of Kanizsa’s biotessitures (reproduced in [19], part III). In Fig. 1, I show a 
stereoscopic image of a mask, covered with a hand-made texture. Although the details 
of the shape are concealed in monocular vision, some information can be retrieved un- 
der the conditions of ‘monocular stereoscopy’. When symmetry is introduced, in the 
manual, quasi-random textures, visually-rich patterns build up, expanding outwards 
from the axis of symmetry (Fig. 2). 

In my view one of the most exciting development in the domain of subjective contours 
is the extension of the phenomenon to 3d surfaces, first in stereo vision [20, 21] then to 
drawings involving both masking and interposition clues [22]. Kanizsa was not the first to 
notice subjective contours, but he made penetrating analyses connecting this domain to the 
domain of the perception of transparency. One point which intrigues me is why, in figures 
in which black and white play symmetrical roles (e.g., Fig 3) we call “subjective” the 
white surface, and “real” the equivalent black surface ? 



3. SUBTLE DIFFERENCES 

One way of settling a point, in visual perception, is to design a couple of images which 
differ, in their construction, by a hardly noticeable feature, and yet produce strikingly 
different effects. For instance Kanizsa showed that the perception of subjective letters, 
represented by their shadows failed at first when the letters were not displayed in their 
usual orientation, and was recovered once the anomaly was recognized. I have used the 
strategy of the subtle difference to study the role of orientation disparity in stereo vision 
[23], 

Here, I show a couple of figures in which the Fraser spiral illusion works, or does not 
work, depending on a subtle detail (Fig. 4). In my opinion, this couple of images estab- 
lishes a bridge between the Fraser illusion and the gestalt principle of segregation by 
contrast (a corolloray to the principle of association by grey-level proximity). 



4. ALTERNATIVE 3D INTERPRETATIONS 

With minimal changes, the drawing of a flat figure can evoke a 3d object. Kanizsa 
showed interest in the nature of the low-level cues which contributed to global 3d inter- 
pretation. There are also figures which elicit both 2d and 3d interpretations. In Fig. 5 the 
two trapeziums are at first interpreted as flat shapes. After a certain time, a 3d interpre- 
tation develops, in which the trapeziums are not even planar! They appear like twisted 
ribbons [24]. Once the 3d interpretation is acquired, it is difficult to see the trapeziums 
as planar again. Incidentally, this may be taken as a (rare) counterexample to the gener- 
icity principle [25]. For other examples of switches in 3d interpretation, see [26, 27]. 
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5. CONTRAST EFFECTS 

The domain of visual contrast effects is still producing novelties, see review in [28] . 
While contrast effects were not, for me, a central preoccupation, I was driven into the field 
by my interest for geometrical questions. Take the well-known Hermann grid effect. It is 
usually presented as a two-dimensional area of black squares, separated by horizontal and 
vertical rectilinear arrays. Is this very peculiar geometry necessary to the illusion ? Orien- 
tation, at least, must be respected, for if a Hermann grid is rotated by 45 degrees, a differ- 
ent contrast effect becomes manifest. Perhaps, then, the Hermann grid is providing cues 
relative to the geometrical layout of the neurons that are performing, in some area of the 
brain, the local contrast calculations. I thus teamed with a geometrically-minded visual sci- 
entist, Kent Stevens, to investigate the geometrical requirements of the Hermann grid illu- 
sion. Hundreds of geometric variants of the grid were generated, including variants giving 
the scintillation effect [29, 30] (see Fig. 6). The original question was not settled, but out 
of the stack of variants, new visual phenomena emerged [31-33]. 

The most spectacular one is the extinction effect [31] (Fig. 7). There, only a few disks 
are seen at a time, in clusters which move with the fixation point. It is as though, outis- 
de the fixation point, a feature which is above the spatial threshold for detection, needs 
also to be above some contrast threshold, with respect to background, in order to be 
brought to attention. 

Distorting the squares of the Hermann grid, one can observe an effect in which illuso- 
ry lines are seen to pulsate [32] (Fig. 8). The orientations of these lines are unusual. They 
correspond to the directions of knight’s moves on a chessboard. These lines go through 
both black and white regions (see Fig. 5 in [32]) and could be testimonies of a coopera- 
tion between neurons having aligned receptive fields of opposite contrasts. 

In the display of Fig. 9, the diamond-oriented domains appear differently contrasted, de- 
pending on whether they contain near horizontal or near vertical stripes [33]. For some peo- 
ple, the domains with near vertical stripes appear highly contrasted, while those with near 
horizontal stripes appear faded. For other subjects, it is the opposite. Once the effect is no- 
ticed, it can be detected in many variants. This type of pattern combines easily with many 
other visual effects. For instance, when straight lines are superimposed on the patterns with 
stripes at different orientations, striking Zollner-type distorsions are observed (Fig. 10). 

After having produced theories, then psychophysical data, I find more and more satis- 
faction, as Kanizsa did, in elaborating striking images. Whereas, in his case, the images 
must have been the outcome of a completely rational line of thinking, in my case they 
came by surprise. They were - at least for Fig. 7 and 8, the unexpected reward of a very 
systematic work of variations in the geometry of the stimuli. 

Prof. Jacques Ninio 
Ecole Normale Siiperieure 
24 rue Lhomomd 
75231, Paris, France 
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Figure 1. Camouflaged stereogram. A mask of the face of a monkey, covered with hand-made texture, was 
photographed from two viewpoints to generate this stereogram. Use the central image with the left image for 
convergent viewing, or with the right image for parallel viewing. Some depth information may be retrieved by 
looking at a single image through a narrow tube - for example, with the hand against the eye, the fingers fold- 
ed to create a cylindrical aperture. 
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Figure 2. Visually-rich patterns obtained by introducing symmetry in a hand-made texture. Near the symme- 
try axes, the appearance of the texture is lost in favor of the patterns. The texture is better appreciated on the 
sides, or when rotating the figure by ninety degrees. 



Figure 3. Subjective contours. In both figures, one sees a regular ring over a background of spaced hexagons. 
One is tempted to say that the ring on the left is real, and the ring on the right is subjective. However -ignor- 
ing the frames- the left and right figures differ by a mere inversion of black and white. Note also that, in a 
sense, each ring must be “cut away” from lines of its own colour. 
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Figure 4. A subtle difference. The Fraser’s spiral illusion works well in the left image, but not in the right im- 
age. The two images differ by a slight rotation of the concentric rings, making their borders real in the right 
image, and subjective in the left image. The orientations of the black or white arcs of a ring, taken separately, 
are typical of a real spiral field. 




Figure 5. Twisted trapeziums. It is possible to see the trapeziums protruding in 3d as twisted ribbons, the hor- 
izontal sides being at the front, and the vertical sides at the back. 
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Figure 6. Scintillation effect (adapted from Bergen [29]). Brilliant spots appear to flash at the crossings of the 
dark alleys. 



Figure 7. Extinction effect. On lines 9, 11 and 13, containing large disks half-way from alley-crossings, all 
disks are seen, while many of the large disks situated at the crossings, on lines 2, 4 and 6, are seen erratically. 
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Figure 8. Flashing lines. Two sets of bright lines are seen pulsating at about 300 and 1200 with respect to the 
horizontal. 




Figure 9. Orientation-dependent contrast. To most observers, the grey-level range appears narrower, either in 
the domains with horizontal stripes, or the domains with vertical stripes. Domains of one kind appear well con- 
trasted, and domains the other kind appears toned down, although the stripes in it are seen with normal reso- 
lution. 
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Figure 10. Distorted diagonals. The distorted appearance of the diagonal lines is reduced or cancelled when 
these lines are oriented horizontally or vertically. The effect is also observed with black or white diagonals. 
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COMMONALITIES BETWEEN VISUAL IMAGERY AND IMAGERY 
IN OTHER MODALITIES: AN INVESTIGATION BY MEANS OF FMRI 



INTRODUCTION 

The attempt to shadow the differences between seeing and thinking by stressing their 
similarities is not an epistemologically correct operation, because by using principles 
related to another domain (as thinking) to explain vision may induce a pre-packed ex- 
plication. Instead, stressing differences may lead to the discovery of new rules gov- 
erning only one of the two processes under investigation (Kanizsa, 1991). We report 
this provoking statement of Kanizsa, while approaching our research on mental im- 
agery for two main reasons: 1) the main part of the psychological research on imagery 
is devoted to visual imagery, implicitly assuming that imagery derived from other sen- 
sory modalities will present characteristics that are similar to those of visual imagery; 
2) a lot of studies on visual imagery are devoted to assess whether primary perceptual 
circuits are implied also in imagery and, therefore to assess how much seeing is simi- 
lar to imaging. In this study we accepted Kanizsa’s suggestion by trying to assess dif- 
ferences between visual and other-senses imagery in order to detect their peculiarities 
and the grade of their overlap. 

Mental imagery has recently gained a renewed interest thanks to the advent of brain 
mapping of cognitive functioning by means of new non-invasive techniques (fMRI, 
functional Magnetic Resonance Imaging, and PET, Positron Emission Tomography). 
This new approach permits the recording and visualization of different parameters re- 
flecting brain activity, with a high temporal and spatial resolution. 

The neuroimaging approach to mental imagery was mainly focused on mapping brain 
correlates of well-established behavioral data in order to clarify the status (epiphenome- 
nal vs. autonomous) of the processes underlying mental imagery. In particular, classical 
experiments on mental manipulation and image generation have been replicated show- 
ing the involvement of several brain areas in mental imagery. 

The first question raised in this debate is linked to the extent of the involvement of the 
primary visual areas, if at all, in visual imagery. This idea is supported by evidence 
showing that focal brain damaged patients exhibit similar impairments in visual percep- 
tion and imagery (for a review see Farah, 1995), and by neuroimaging data showing ac- 
tivation in the occipital lobe in various visual imagery tasks (Chen, Kato, Zhu, Ogawa, 
Tank & Ugurbil, 1998; Kosslyn & Thompson, 2000; Klein, Paradis, Poline, Kosslyn & 
Le Bihan, 2000). The hypothesis of the involvement of the primary visual areas in im- 
agery is based on the assumption that visual imagery is depictive in nature (Kosslyn, 
1994) and should share the same neural substrate of visual perception (Kosslyn, Alpert, 
Thompson, Maljkovic, Weise, Chabris, Hamilton, Rauch & Buonomano, 1993). This 
idea rests on the hypothesis that mental imagery activates backward projections from 
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‘high-level’ to ‘low-level’ areas of the visual system (Kosslyn, Maljkovic, Hamilton, 
Horwitz & Thompson, 1995) due to the retrieval of stored information in order to re- 
construct spatial patterns in topographically organized cortical areas. 

However, other studies did not find any evidence of the involvement of primary visu- 
al areas in visual imagery (De Voider, Toyama, Kimura, Kiyosawa, Nahano, Vanlierde, 
Wanet-Defalque, Mishina, Oda, Ishiwata et al., 2001; Mellet, Tzourio Mazoyer, 
Bricogne, Mazoyer, Kosslyn & Denis, 2000; Cocude, Mellet & Denis, 1999; Mellet, 
Tzourio Mazoyer, Denis & Mazoyer, 1998; D’Esposito, Detre, Aguirre, Stallcup, Alsop, 
Tippet & Farah, 1997), or found selective impairment in either imagery or perception 
following focal brain damage (Bartolomeo, Bachoud-Levi, De Gelder, Denes, Dalla 
Barba, Brugieres & Degos, 1998). 

It should be noted that between these two groups of studies there are many differences 
in techniques, procedures and experimental tasks. 

Regarding other brain areas, the middle-inferior temporal region, especially on the left 
hemisphere, has been repeatedly found to be active in various image generation (D’Es- 
posito et al., 1997) and mental rotation (Iwaki, Ueno, Imada & Tonoike, 1999; Barnes, 
Howard, Senior, Brammer, Bullmore, Simmons, Woodruff & David, 2000; Jordan, 
Heinze, Lutz, Kanowski & Lanche, 2001 ) tasks. These data support the idea that modal- 
ity specific processes underlie mental imagery because activation in this area has been 
found also in visual object recognition (Stewart, Meyer, Frith & Rothwell, 2001) and in 
tasks requiring the recovery of visual features (Thompson-Schill, Aguirre, D’Esposito & 
Farah, 1999). However, by reviewing previous data on imagery tasks. Wise, Howard, 
Mummery, Fletcher, Leff, Biichel & Scott (2000) suggest that the core of this activation 
should be situated in the hetero-modal associative temporal cortex. According to these 
authors, this area mediates access to the amodal/non-linguistic internal representations 
of word meanings, and this role would be more coherent with the results obtained by 
means of such a wide range of cognitive tasks. In this view, its role in mental imagery 
would be independent from modality specific representations. 

Activation of associative areas in the parietal lobe has also been found but its role in 
mental imagery is somewhat controversial. Some authors suggest that these regions con- 
tribute to the processing of spatial attributes of imaged objects (Diwadkar, Carpenter & 
Just, 2000), others outline their role in the construction of a supra-modal representation 
by binding together modality specific information (Lamm, Windischberger, Leodolter, 
Moser & Bauer, 2001; Richter, Somorjai, Summers, Jarmasz, Menon, Gati, Georgopou- 
los, Tegeler & Kim, 2000). Finally, Carey (1998) suggests that this area should be con- 
sidered a key component of a third visual stream (besides the ventral and the dorsal path- 
ways) having perceptual, attentional and motor-related functions. 

In the context of mental imagery, the role of the prefrontal cortex known to be re- 
sponsible for working memory operations, has been somewhat neglected, perhaps due 
to the heterogeneous pattern of activation emerging from different studies. As reported 
by several authors, spatial working memory tends to activate the right prefrontal cor- 
tex, whereas verbal tasks involve mainly the left or bilateral prefrontal cortex (Burbaud, 
Camus, Guehl, Bioulac, Caille & Allard, 2000; Bosch, Mecklinger & Friederici, 2001). 
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Studies examining the relationship between imagery and processes related to modali- 
ties other than vision are very rare. Among them Zatorre, Halpern, Perry, Meyer & Evans 
(1996), by comparing the PET data from auditory perception to those derived from au- 
ditory imagery, conclude that the same brain regions were activated in the two tasks. 
Hollinger, Beisteiner, Lang, Lindinger & Berthoz (1999) compared slow potentials ac- 
companying the execution of movements to the response accompanying their imagina- 
tion, and found again that similar regions were at work in both cases. 

In summary, although data support the idea that perception and imagery share a com- 
mon neural substrate, findings are not univocal, suggesting that this common substrate 
does not involve early perceptual stages. 

Neuroimaging techniques offer the possibility to investigate another interesting aspect 
of mental imagery, i.e. the distinctive features of intermodal mental imagery. 

From a behavioral point of view the interest for this topic is rather old as testified by 
the construction of quantitative instruments aimed at evaluating mental imagery, not on- 
ly in the visual modality, but also in the auditory, haptic, kinesthetic, gustatory, olfacto- 
ry and organic ones (Betts, 1909; Sheehan, 1967; White, Ashton & Brown, 1977). Some 
of these studies investigate the relationships between visual and auditory imagery (Gis- 
surarson, 1992), visual and kinesthetic imagery (Farthing, Venturino & Brown, 1983), 
and visual imagery and olfactory stimulation (Wolpin & Weinstein, 1983; Gilbert, 
Crouch & Kemp, 1998). Other studies are concerned with the reported vividness of ex- 
perienced imagery (see for example Campos & Perez, 1988; Isaac, Marks & Russell, 
1986). Overall, these studies contributed to the imagery debate by legitimizing and en- 
couraging further investigations in this field. 

From a neurophysiological perspective, an increasing number of researchers has re- 
cently adopted different psycho-physiological and neuroimaging techniques in order to 
investigate intermodal connections (see for example Fallgatter, Mueller & Strik, 1997; 
Farah, Weisberg, Monheit & Peronnet, 1990; Del Gratta, Di Matteo, De Nicola, Ferret- 
ti, Tartaro, Bonomo, Romani & Olivetti Belardinelli, 2001; De Voider et ah, 2001). 

However, until now little is known about how we imagine either an odor or the taste 
of our favorite dishes, or how we mentally reproduce the typical sound of everyday 
events. At least two specific aspects should be investigated both from a behavioral point 
of view, and from a neuro-physiological perspective; first, the specificity of mental im- 
agery linked to each sensory modality; second, the degree of overlap between visual 
imagery and other types of imageries. Both questions would allow us to clarify the na- 
ture of mental imagery: the former by studying the imagery process on a more exten- 
sive set of perceptual-like objects, the latter by studying how much imagery according 
to various sensory modalities is tied to the processing of visual features. 

The present study is devoted to the second question by trying to identify the common 
substrate of visual images and images generated according to other sensory modalities. 
It consists of a fMRI block design while participants were requested to generate men- 
tal images cued by short sentences describing different perceptual object (shapes, 
sounds, odors, flavors, self-perceived movements and internal sensations). Imagery 
cues were presented visually and were contrasted with sentences describing abstract 
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concepts, since differences in activation during visual imagery and abstract thoughts 
were often assessed in literature (Lehman, Kochi, Koenig, Koykkou, Michel & Strik, 
1994; Goldenberg, Podreka, Steiner & Willmes, 1987; Petsche, Lacroix, Lindner, Rap- 
pelsberger & Schmidt, 1992; Wise, Howard, Mummery, Fletcher, Leff, Btichel & 
Scott, 2000). 



EXPERIMENT 

METHODS 

PARTICIPANTS 

Fifteen healthy volunteers, after signing an informed consent waiver, participated in this 
study, which was approved by the local ethics committee. Eight of them were females and 
seven of them were males, and their age ranged between 19 and 20. All of them were right 
handed as well as their parents. 



DESIGN 

The experimental task required subjects to generate mental images cued by visually 
presented stimuli. Each experimental session of a single subject consisted of three 
functional studies and a morphological MR1. in each functional study, stimuli belong- 
ing to one experimental condition (regarding one of the seven sensory modalities) 
were delivered, together with stimuli belonging to the control condition. The first ex- 
perimental condition always belonged to the visual modality, while those in the other 
two were evenly chosen among the remaining six modalities. Overall, the visual 
modality was studied fifteen times, while the other six modalities were studied five 
times. The number of modalities studied for each subject was limited to three in order 
to avoid lengthy recording sessions. The visual modality was always included and 
used as a reference. 

Functional studies were performed according to a block paradigm, in which 12 volumes 
acquired during mental imagery - i.e. during experimental stimulus delivery - were alter- 
nated three times with 12 volumes acquired during baseline - i.e. during control sentence 
delivery. Experimental and control stimuli were presented at the start of the first volume, 
and then at every fourth one, so that three different experimental stimuli or three different 
control stimuli, were presented in each block. Each stimulus, or control sentence, remained 
visible until it was replaced by the following. Thus, the subject could see every stimulus 
for the whole time interval corresponding to the acquisition of four volumes, i.e. 24 sec- 
onds. The duration of a block was therefore 72 seconds, and the total duration of a study 
was 7 minutes 12 seconds. Overall, nine different experimental stimuli and nine different 
control sentences were presented in each study. 
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STIMULUS MATERIAL 

Eight different sets of sentences referring to either concrete or abstract objects were used 
as mental image generation cues. Seven sets were used in the experimental condition and 
the remaining one was used in the control condition as a baseline. Each experimental set 
consisted of nine sentences, whereas the control set consisted of 27 sentences, and each 
sentence was composed by three or four words identifying either a definite perceptual ob- 
ject or an abstract concept. 

The experimental sets contained sentences referring respectively to the visual, auditory, 
tactile, kinesthetic, gustatory, olfactory, and organic modalities. The control set contained 
sentences referring to abstract concepts. The English translation of an example included in 
each set is given: Seeing a coin (visual), Hearing a rumble (auditory), Touching a soft ma- 
terial (tactile), The act of walking (kinesthetic), Smelling wet paint (olfactory), Tasting a 
salty food (gustatory), Feeling tired (organic), Admitting a misdeed (abstract). 

The entire set of stimuli was presented to the participant of the present experiment after 
the end of the experimental session in order to obtain data on the effectiveness of the ma- 
terial. Results revealed that participants classified 96% of visual items, 85% of auditory 
items, 88% of tactile items, 74% of kinesthetic items, 90% of olfactory items, 90% of gus- 
tatory items, 47% of organic items, and 55% of abstract items respectively as visual, au- 
ditory, tactile, kinesthetic, gustatory, olfactory, organic, and none of the previous cate- 
gories. Chi-square comparison for each modality between observed and expected fre- 
quencies reveals that participants’ responses match the item classification (p<0.001). 
Moreover, the rating of the power to evoke mental images (on a scale range from 1 to 7) 
revealed that modality specific items obtained an average score of 5.30 (s.d. 0.67) while 
abstract items achieved an average value of 2.10 (s.d. 1.22) (t=l 1 .98, p<0.0001). The re- 
sult was confirmed also for each single modality vs. abstract items comparison (p<0.001). 

PROCEDURE 

Subjects were interviewed in order to verify the lack of contra-indications at participat- 
ing in the experiment and were acquainted with the experimental apparatus. They were 
then informed that they would be presented a set of sentences and were instructed to men- 
tally read these sentences, without moving their lips, to concentrate on them, and to try to 
imagine their content. 

Experimental and control sentences were projected on a translucent glass placed on the 
back of the scanner bore by means of an LCD projector and two perpendicular mirrors. 
An additional mirror fixed to the head coil inside the magnet bore allowed the subject to 
see the translucent glass. The LCD projector was driven by a PC placed at the scanner 
console and connected to it via a VGA cable through a hole in the shielded room. The PC 
was manually controlled by an operator, according to the volume acquisition timing. The 
stimuli and control sentences were administered by means of a slide presentation soft- 
ware, and were printed in yellow on a blue background. No artifacts due to the projector 
or the VGA cable were visible in the functional as well as in the morphological images. 
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APPARATUS 

Functional MRI was performed with a SIEMENS VISION 1.5T scanner endowed 
with EPI (Echo Planar Imaging) capability. Each functional volume was acquired by 
means of an EPI FID (Free Induction Decay) sequence with the following parameters: 
30 bicommisural transaxial slices 3 mm thickness; no gap, matrix 64 x 64; FOV (Field 
Of View) 192; 3 mm x 3 mm in-plane voxel size; flip angle 90°; TR 6 s; TE 60 ms. 
Scan time for one volume was three seconds. The image volume covered the whole 
brain. 

In addition to functional images, a high resolution, morphological MRI was acquired at 
the end of each session, by means of a 3D-MPRAGE (Magnetization Prepared Rapid 
Gradient Echo) sequence. The parameters characterizing this acquisition were: 240 axial 
slices, 1 mm thickness, no gap, matrix 256 x 256, FOV 256 mm, in-plane voxel size 1 
mm x 1 mm, flip angle 12°, TR = 9.7 ms, TE = 4 ms. 

DATA ANALYSIS 

Functional data were analyzed using MEDx software by Sensor Systems. First, all vol- 
umes in a study were realigned, in order to correct for physiological subject movement, 
with the software AIR included in the MEDx software package. Then, data were grouped 
according to the various sensory modalities. Three different groups were formed for the vi- 
sual modality so that comparison between the latter and other modalities was performed 
within the same group of subjects. All functional volumes were transformed into Talairach 
space and, within a single modality, or within a single group of subjects in the visual 
modality, were merged to form a larger block paradigm, consisting of 360 volumes. All 
volumes in such a group were normalized to the same baseline level. A spatial gaussian fil- 
ter 4 mm FWHM was applied. Voxel time courses were high pass filtered. The volumes in 
each modality group were divided into subgroups corresponding to volume acquired dur- 
ing the presentation of modality specific stimuli, and during the presentation of control 
sentences respectively. Then a voxel-by-voxel Student t-test was performed, and the cor- 
responding Z-score maps were calculated and thresholded at Z=2.5 corresponding to a null 
probability p<0.006 (uncorrected). Subsequently the clustering algorithm of the MEDx 
package was run on these maps, thus selecting only the clusters of activation with a prob- 
ability larger than 0.5. Finally the clustered Z-score maps were superimposed on a high res- 
olution, Talairach transformed, morphological image. 

Finally in each map we looked for activation areas common to the visual on one 
hand, and the remaining modalities on the other hand. To this end we compared the 
thresholded Z-score maps of a pair of modalities and selected the voxels that were sig- 
nificantly activated in both. This yielded maps of voxels activated in both modalities, 
which were then classified according to their neuroanatomical location by means of 
the Talairach atlas. 
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RESULTS 

In all modalities a number of activated areas was clearly observed above the threshold. 
In Table 1 group data are listed. For each activation area the Talairach coordinates of the 
centroids of clusters of activation are indicated, together with the corresponding Brod- 
mann area, where applicable. 

In the visual modality the most prominent areas of activation are distributed bilateral- 
ly even though the right hemisphere, overall, appears to be more activated than the left 
one. However, activation in the temporal area was more intense on the left. Other promi- 
nent activation is observed in several orbital frontal areas, mainly on the right hemi- 
sphere. 

In the auditory modality the main areas of activation were, bilaterally, in the middle tem- 
poral area, in the middle and superior pre-frontal area, and, unilaterally, in the left insula-pre- 
central gyms. 

In the tactile modality the activation pattern is quite asymmetrical, with the most 
prominent activation in the left hemisphere. In the left hemisphere areas of activation are 
observed in the middle-inferior temporal, inferior frontal, inferior parietal areas. One 
symmetrical activation is observed in the insula, which is however much larger and more 
intense in the left hemisphere. Another symmetrical activation is in the post-central 
gyms, here too, much more intense in the left hemisphere. 

In the olfactory modality, bilateral areas of activation are observed in the middle frontal 
gyms. In the left hemisphere, prominent activated areas are observed in the inferior-mid- 
dle temporal gyms, in the parietal area and in the middle prefrontal gyms. Overall, the 
left hemisphere appears to be more activated than the right one. 

The gustatory modality shows a rough symmetry regarding the location of the active 
areas, but a strong asymmetry regarding their extension, the activation in the left hemi- 
sphere being much larger. Activated areas in both hemispheres are in the parietal region, 
in the post-central gyrus, in the insula, and in prefrontal areas. 

The organic modality shows a bilateral compound symmetrical activation pattern 
around the superior temporal area, with a maximum in the insula, in the pre-central op- 
erculum, and in the post-central gyrus, and a bilateral activation in the middle-superior 
frontal areas. Activation was also observed in the left parietal area, and in the left infe- 
rior temporal gyrus. 

The kinesthetic modality shows rather symmetrical activation in the cingulate gyrus, in 
the middle and inferior temporal areas. In the left hemisphere, activation is observed in 
the precuneus while in the right hemisphere activation is observed in the fusiform gyrus. 

The maps of voxels that were significantly activated both in the visual modality and in 
each of the other sensory modalities compared with it yielded the following results. 

Visual and auditory modalities share a bilateral activation in the middle-inferior tem- 
poral area, although the maximum of communality is limited to a little portion only. 
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Modality 


Hemisphere 


Area 


Fusiform/Hippocampal 
gyrus (BA37) 

Middle-inferior 
temporal gyri (BA37) 

Superior temporal 
gyrus (BA37) 

Insula 

Post-central gyrus 
(BA2) 

Post-central gyrus 
(BA43) 

Precuneus (BA7) 

Inferior parietal lobule 
(BA40) 

Middle-inferior frontal 
gyri (BA6) 

Middle frontal gyrus 
(BA9/10) 

Superior frontal gyri 
(BA9/10) 

Middle-inferior frontal 
gyri (BA44/46/47) 

Posterior cingulate 
gyrus (BA31) 



Table 1 . Talairach coordinates for the activated areas in the different modalities. 
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Figure 1. a) Visual and auditory modalities exhibit common activation in middle temporal areas. The upper right 
panel shows a left sagittal view, while the lower right shows a right sagittal view. In this, and the following fig- 
ures, colored voxels indicate Z>2.5 corresponding to a null probability p<0.006. Blue voxels represent visual im- 
agery activation, yellow voxels represent compared modality activation, and green voxels indicate common ar- 
eas of activation. Note that the left side of each image represents the right hemisphere and viceversa. 

b) Visual and tactile modalities share common bilateral parietal areas (left panel): note the different degree of 
asymmetry in the extension of the activation. Other common activated areas in the parietal and middle tempo- 
ral regions are shown in a left sagittal view (right panel). 

c) Visual and olfactory modalities reveal common activated areas in the parietal and middle temporal region 
(left panel showing a left sagittal view), in the left middle temporal and middle frontal region; note the reversed 
asymmetry of the two patterns of activation (central panel), and in the left fusiform gyri (right panel). 

d) Visual and gustatory modalities show common activated areas in the inferior parietal region bilaterally (left 
and central panel) and in the right middle-inferior frontal areas (right panel). 



In visual and tactile modalities activated areas are seen in the parietal lobe, with a dif- 
ferent degree of asymmetry: in the left hemisphere activation is about equal, while in the 
right hemisphere the visual modality produces a more extended activation. Other com- 
mon areas are in the left middle-inferior temporal area. 

Visual and olfactory modalities show quite different activation patterns, with acti- 
vated areas mainly in the left hemisphere, namely in the middle-inferior temporal re- 
gion, and the parietal area. Middle frontal areas are activated bilaterally in both 
modalities, but with a reversed pattern of asymmetry. Indeed, in the left hemisphere 
the olfactory modality shows a more extended activation, while the reverse is true in 
the right hemisphere. 

In visual and gustatory modalities, an overlap of activation in the inferior parietal area 
is observed bilaterally but with a different degree of symmetry. In addition, overlaps are 
seen in the right middle-inferior frontal areas. 

No significant communalities were found between visual and organic imagery and 
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between visual and kinesthetic imagery. However, both the organic and the kines- 
thetic modalities show a consistent activation in the left middle-inferior temporal 
area, even if it does not overlap with the corresponding area in the visual condition. 
In addition, activation in the parietal lobe was found for the organic modality but it 
does not share a common area with the corresponding activated region for the visu- 
al modality. 



DISCUSSION 

This study indicates that each type of mental imagery exhibits a different degree of 
overlap with visual imagery for what concerns their cerebral correlates as revealed by 
fMRI. 

In general, visual imagery activates mainly the right hemisphere, while the tactile, ol- 
factory, and gustatory imageries elicit prominently left activation. Finally, auditory, 
kinesthetic, and organic imageries involve equally both hemispheres. 

The most consistent region of overlap is the middle-inferior temporal area, especially 
on the left hemisphere. In fact, auditory, tactile, and olfactory imageries all show com- 
mon activated areas in this region. In addition, organic and kinesthetic modalities also 
show activation in this region even though without any overlap. 

The parietal associative areas also exhibit a certain degree of consistency, because com- 
mon areas of activation with visual imagery were found for tactile, olfactory and gusta- 
tory modalities. Here again, the organic modality shows an activated area in this region 
but it does not overlap with the corresponding area in the visual condition. 

The prefrontal areas show a less consistent pattern of activation as they reveal common 
areas of activation only for the visual-gustatory comparison. However, activation in pre- 
frontal areas was also found in the visual-olfactory comparison, although the intensity 
pattern in the two hemispheres was reversed across modalities. 

In some cases, different sensory imageries activate the same area, in other cases the ar- 
eas of activation are close to each other in the same neuroanatomical area, indicating that 
the region is involved in both modalities but perhaps in a non-perfectly coincident way. 
Both cases could be explained, according to Calvert, Brammer & Iversen (1998), by the 
fact that the hetero-modal cortex either contains neurons responding to more than one 
modality or has closely interspersed populations of modality specific neurons, which are 
responsive to different modalities. 

Another tentative explanation could be derived from the proposal put forward by 
Singer (2000) regarding the coexistence in the mammalian brain of complementary 
strategies for the representation of mental contents: a strategy for items that occur very 
frequently and/or are of particular behavioral importance, and a second one reserved for 
items which are infrequent, novel, or of high complexity. Reviewing experimental data 
on vision, audition, motion and olfaction, Singer suggests that, in the latter case, items 
are coded by dynamically associated assemblies of feature-tuned cells formed by rapid 
and transient synchronization of the associated neurons. According to this hypothesis. 
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the partial mismatch between activated areas in visual imagery and activated areas in 
other sensory modalities could be due to different degrees of either salience or com- 
plexity of the requested images. 

In this study, the inferior temporal region is often activated bilaterally, although the 
most consistent activation is found on the left side, especially in the auditory, tactile, 
olfactory modalities compared with the visual one. Regarding the visual modality 
while D’Esposito et al. (1997) reported left fMRI activation in this site, Mellet et al. 
(1998 and 2000) reported a bilateral PET activation. They maintain that the right side 
of the inferior temporal area is responsible for the processing of complex shapes, while 
the left side seems to be engaged in the processing of verbalizable shapes. In this view 
the presence or the absence of right temporal cortex activation would depend on the 
complex or simple nature of the visual image processing, while the left activation 
would depend on being or not being verbalizable. In our study, all the images were 
cued by verbal items and the task requirements were held constant across conditions, 
therefore there is no reason to suppose complexity differences in the processing of dif- 
ferent images. In light of previous data, the left side of this area might have a role in 
connecting the verbal encoding of a word with its deeper representation (Wise et al., 
2000). However, the lack of any overlapping temporal activation for the gustatory 
modality and a non perfectly coincident overlap for the organic and kinesthetic modal- 
ities suggest also that this area may reflect the segregation of semantic knowledge in- 
to anatomically discrete, but highly interactive, modality specific regions (Thompson- 
Schill et al„ 1999). 

The activation of the parietal region in imagery processes is often reported in lit- 
erature, frequently in association with task related to spatial processing (Iwaki et al., 
1999; Banati, Goerres, Tjoa, Aggleton & Grasby, 2000; Barnes et al., 2000; Diwad- 
kar, Carpenter & Just, 2000; Jordan et al., 2001). However, although in our task spa- 
tial processing was not requested, a consistent common activation, albeit not in- 
cluding all the modalities, was found in the parietal region. Jordan et al. (2001) sug- 
gest that this region may be responsible for the transformation of shapes into a 
supra-modal form, thus enabling the cognitive system to process visuo-spatial fea- 
tures in a way that is independent from sensory features. According to these authors, 
the network underlying this transformation may be involved in low-level attention- 
al processes, working for many types of cognitive processes (Coull & Frith, 1998; 
Coull & Nobre, 1998). In this view, the common area of activation we found in the 
parietal region may reflect supra-modal transformations mediated by low-level at- 
tentional process. 

Regarding the prefrontal areas, some data support the idea that this region may be re- 
lated to the memory retrieval of mental images. Our data is consistent with previous da- 
ta indicating a hemispheric domain-specificity of the prefrontal cortex (right-sided for 
spatial WM, bilateral or left-sided for verbal WM). In our study the visual modality 
show a right-sided activation in the prefrontal areas, while other modalities show a 
composite pattern distributed either bilaterally or on the left hemisphere, yielding only 
an overlap in the right middle prefrontal area between the visual and the gustatory 
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modalities. This pattern of activation may suggest that visual imagery (and perhaps al- 
so the gustatory one) relies on spatial processes, while other modalities rely more upon 
verbal processes. 

The reverse pattern of activation found in prefrontal areas for visual and olfactory 
modalities is coherent with data reported by Zald & Pardo (2000) asserting a mainly left 
prefrontal activation in odorants hedonic judgments. However, this data refers to a per- 
ceptual task and can supply only an indirect indication to olfactory imagery. 

The lack of any consistent activation in primary sensory areas could be due to the kind 
of task used in this study. As suggested by Thompson, Kosslyn, Sukel & Alpert (2001), 
the primary visual cortex is activated more often when participants are requested to use 
the image in some way. In our study, in order to minimize differences among the condi- 
tions, apart from those related to the imagery modality, participants were simply re- 
quested to mentally represent the target item, i.e. they were requested to perform an im- 
age generation task. According to Behrmann (2000), image generation is a process more 
specific to imagery than image manipulation, because it involves the active reconstruc- 
tion of a long-term mental representation. Moreover, in our opinion, image manipulation 
involves some kind of on-line processing that might be more dependent on the specific 
content of the image to be manipulated. As our study was aimed at identifying the com- 
mon substrate of different imagery modalities, the image generation task seems to imply 
processes supposed to be less variable across modalities. 

An alternative explanation for the lack of activation of primary visual areas may be due 
to the visual presentation of the items. However, studies that contrasted concrete items 
vs. abstract items by using an auditory presentation (De Voider et al., 2001 ; Mellet et al., 
1998; D’Esposito et al., 1997) found substantially the same pattern of results for the vi- 
sual modality. 

Whether common areas indicate either the involvement of amodal functional circuits 
in mental imagery, or the presence of a visual imagery component also in different types 
of mental images, should be the object of further investigations. However, the first hy- 
pothesis is a little more coherent with the results of the ratings of vividness of the mate- 
rial used in this study, which show a clear cut among different types of images. 

From this study, we can derive three key findings. First, common brain areas were 
found to be active in both visual imagery and imagery based on other sensory modali- 
ties. These common areas are supposed to reflect either the verbal retrieval of long-term 
representations or the segregation of long-term representations into highly interactive 
modality specific regions. 

Second, each imagery modality activates also distinct brain areas, suggesting that high- 
level cognitive processes imply modality-specific operations. This result is coherent 
with the domain-specific hypothesis proposed for the functioning of the fronto-parietal 
associative stream (Rushworth & Owen, 1998; Miller, 2000). 

Third, primary areas were never found to be active, suggesting that different, though 
interactive, neural circuits underlie low-level and high-level processes. Although this 
claim is only indicative, as in this study, no direct comparisons were made between im- 
agery and perceptual/motor processes, it outlines the lack of primary cortex activation 
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for imagery in those modalities that were not accompanied by any corresponding senso- 
ry stimulation due to either the visual presentation of the stimuli or to the noisy appara- 
tus. Further investigations will be essential to extensively clarify this claim. 
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FORMS AND SCHEMES OF PERCEPTUAL 
AND COGNITIVE SELF-ORGANISATION 




VICTOR ROSENTHAL 



MICROGENESIS, IMMEDIATE EXPERIENCE 
AND VISUAL PROCESSES IN READING 



INTRODUCTION 

The concept of microgenesis refers to the development on a brief present-time 
scale of a percept, a thought, an object of imagination, or an expression. It defines 
the occurrence of immediate experience as dynamic unfolding and differentiation in 
which the ‘germ’ of the final experience is already embodied in the early stages of 
its development. Immediate experience typically concerns the focal experience of 
an object that is thematized as a ‘figure’ in the global field of consciousness; this 
can involve a percept, thought, object of imagination, or expression (verbal and/or 
gestural). Yet, whatever its modality or content, focal experience is postulated to 
develop and stabilize through dynamic differentiation and unfolding. Such a mi- 
crogenetic description of immediate experience substantiates a phenomenological 
and genetic theory of cognition where any process of perception, thought, expres- 
sion or imagination is primarily a process of genetic differentiation and develop- 
ment, rather than one of detection (of a stimulus array or information), transforma- 
tion, and integration (of multiple primitive components) as theories of cognitivist 
kind have contended. 

The term microgenesis was first coined by Heinz Werner (1956) as a means of providing 
a genetic characterization of the structure and temporal dynamics of immediate experience, 
and, more generally, of any psychological process (Werner, 1957; Werner & Kaplan, 1956; 
Werner & Kaplan, 1963). But the genetic framework to which this term referred actually 
emerged in the mid- 1920s in the context of Werner’s work at the University of Hamburg 
and, to a certain extent, of the work of the Ganzheitspsychologie group in Leipzig led by 
Friedrich Sander. For Werner, microgenesis had not only a substantive (as a psychological 
theory) but also a methodological meaning. As a method, it either referred to genetic real- 
ization (Aktualgenese) which sought to provide the means of externalizing the course of 
brief perceptual, or other cognitive processes by artificially eliciting ‘primitive’ ( i.e . devel- 
opmentally early) responses that are normally occulted by the final experience (see in this 
respect Sander, 1930; Werner, 1956). Or it referred to experimental psychogenesis which 
aimed to construct small-scale, living models of large-scale developmental processes in 
such a way as to ‘miniaturize’ (i.e. accelerate and/or telescope) the course of a given 
process and bring it under experimental control. Experimental psychogenesis, devised by 
Werner in the 1920s, played afterwards an important role in the work of Vygotsky and 
Luria who further extended its field of application and gave it a historical dimension 
(Catan, 1986; Vygotsky, 1978; Werner, 1957, first German edition published in 1926; 
Werner & Kaplan, 1956). As a theoretical framework, microgenesis constituted a rectifi- 
cation of Gestalt theory especially in regard to its overly structural and agenetic character 1 . 
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Yet, together with the latter, microgenesis gave psychology its first cognitive paradigm. In 
its modem version, microgenesis offers a genetic, phenomenological alternative to the in- 
formation-processing metaphor, an alternative that reunites mind and nature and restores 
to cognition its cultural and hermeneutic dimensions. 

My purpose in this essay is to provide an overview of the main constructs of microge- 
netic theory, to outline its potential avenues of future development in the field of cogni- 
tive science, and to illustrate an application of the theory to research, using visual 
processes in reading as an example. In my overview, I shall not dwell on the history of 
microgenesis (the reader may find the relevant sources in Catan, 1986; Conrad, 1954; 
Sander, 1930; Werner, 1956; Werner, 1957; Werner & Kaplan, 1956) but rather describe 
its main constmcts from a contemporary perspective. 



MICROGENETIC DEVELOPMENT 

Microgenetic development concerns the psychogenetic dynamics of a process that can 
take from a few seconds (as in the case of perception and speech) up to several hours or 
even weeks (as in the case of reading, problem solving or skill acquisition). It is a living 
process that dynamically creates a structured coupling between a living being and its en- 
vironment and sustains a knowledge relationship between that being and its world of life 
( Umwelt ). This knowledge relationship is protensively embodied in a readiness for further 
action, and thereby has practical meaning and value. Microgenetic development is thus an 
essential form of cognitive process: it is a dynamic process that brings about readiness for 
action 2 . Microgenesis takes place in relation to a thematic field which, however unstable 
and poorly differentiated it might be, is always given from the outset. To this field, it brings 
stabilized, differentiated structure and thematic focalization, thereby conferring value and 
meaning to it. Figure/ground organizations are an illustration of a typical microgenetic de- 
velopment. Yet, one should bear in mind that however irresistible an organization might 
appear, it is never predetermined but admits of alternative solutions, that a ‘figure’ em- 
bodies a focal theme, and that a ‘ground’ is never phenomenologically or semantically 
empty. Thematic field denotes here a definite field of consciousness, and has both phe- 
nomenological and semantic meaning (see Gurwitsch, 1957). Focal thematic embodiment 
of microgenetic development thus differs from unfocussed, heterogeneous, and hete- 
rochronic ontogenesis, which spans a considerable portion of life and requires organic mat- 
uration and growth (see Werner, 1957, for a discussion of differences between microge- 
netic development and ontogenesis; Werner & Kaplan, 1956). 



MEANING AND FORM 

It should be noted that form, meaning and value are not considered separate or in- 
dependent entities. According to microgenetic theory, whatever acquires the phe- 
nomenological status of individuated form acquires, ipso facto, value and meaning. 
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This may not necessarily be a focally attended meaning, as the definite experience of 
meaning depends on whether a form is given a focal thematic status. Yet, regardless 
of the status of the meaning experience, microgenetic theory postulates that form has 
of necessity semantic and axiological extensions. Incidentally, this point highlights 
the radical opposition between microgenesis and the standard cognitivist stance, 
where form, meaning and value are deemed independent and mobilize processes that 
are intrinsically alien to one another. If meaning and value are acknowledged to af- 
fect perception, as the seminal experiment of Bruner and Goodman (1947) revealed 
by showing that the size of a coin is seen as bigger when it is highly valued, it is as- 
sumed that this influence obtains via the interaction of processes. Yet no precise ex- 
planation has been supplied as to how structurally different processes, which deal 
with incommensurable factors, can ever interact with one another 3 . Werner and Wap- 
ner (1952) observed many years ago, that theories which separate sensory, semantic, 
motivational and emotional processes, and view perception as a construction of ab- 
stract forms out of meaningless features (only to discover later their identity and 
meaning), face in this respect insurmountable paradoxes. If semantics postdates mor- 
phology, then it cannot affect form reconstruction, and if semantics is concomitant 
with form reconstruction, how can it influence morphological processing prior to 
‘knowing’ what the latter is about? Finally, since morphological and semantic 
processes are viewed as incommensurable, how can they be brought to cooperate to- 
gether without recourse to yet another, higher-order process? Invoking such a process 
would either amount to conjuring up a sentient device of the homunculus variety or 
would stand in contradiction to the very postulate of the distinctness and independ- 
ence of meaning and form. 



CATEGORIZATION 

According to the present account, no such interaction is to be sought because meaning 
and form are not separate or independent entities; on the contrary, perception is directly 
meaning and value-laden, with actual meaning developing along the global-to-local (in- 
definite/general-to-definite/specific) dynamics of microgenesis. The gradual differentia- 
tion of a meaning, percept or concept involves a global-to-local course of development, 
where meaning and value go hand-in-hand with perceptual or cognitive organization, de- 
veloping from vague and general to definite and specific. Note, though, that no direct ho- 
listic principle can be viable if it does not rely on a process of categorization. Immedi- 
ate categorization represents another essential feature of microgenetic development: it 
provides the dynamic link between holistic differentiation, meaning and readiness for ac- 
tion. Consider, indeed, that even the most basic categorization has meaning - meaning 
is thus not the end product of perception but rather part and parcel of the perceptual 
process 4 . The psychological literature gives ample evidence of the fact that subjects car- 
ry out basic categorization instantaneously (e.g., discrimination of relevant from irrele- 
vant stimuli), without first making a more complete identification of the stimuli, and that 
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preliminary categorization improves the rate of definite identification (Brand, 1971; In- 
gling, 1972). It should also be emphasized that categorization necessarily delineates a 
horizon of action, a horizon that comprises a range of relevant acts that the subject may 
potentially be compelled to enact. 

Perception can then be said to act under the assumption of the consistency and mean- 
ingfulness of the world in which we live: the perceptual system ‘assumes’ that whatev- 
er it encounters has structure and meaning. It therefore anticipates and actively seeks 
meaningful structures (objects) and immediately categorizes them on a global dynamic 
basis 5 . Microgenetic theory contains here a hermeneutic principle: in order to be mean- 
ingful, perception must consist in dynamic categorization evolving from general to spe- 
cific, from vague and global to precise and local (see Rastier, 1997). Incidentally, this 
explains why the ‘germ’ of the final percept is already embodied in early stages of the 
perceptual process. Immediate categorization allows for the categorical continuity of 
forms throughout the entire perceptual process giving cohesiveness and stability to the 
perceived world (see Cadiot & Visetti, 2001; Gurwitsch, 1966, chap. 1). This primary 
categorization may be insufficient for the precise and overt perceptual identification of 
objects - as required by standard psychological experiments - and may then need to be 
completed by a process of local discrimination. This complementary discrimination, 
necessary for the focal thematization of a ‘figure’, is greatly constrained by the former 
process; it operates within a restricted categorical domain, and can thus bear selectively 
on the properties of the percept that are distinctive. From a phenomenological viewpoint, 
discrimination is what brings about the overt identification of a percept. 



BREAKING UP THE HOLISTIC FABRIC OF REALITY 

The segmentation of the perceptual field into individual objects is thus the result of per- 
ceptual differentiation, and not the objective state of affairs that perception would mere- 
ly seek to detect and acknowledge. In this sense, microgenesis is the process that breaks 
up the holistic fabric of reality into variably differentiated yet meaningful objects, beings 
and relations. From Aristotle to Poincare and Thom, scores of philosophers and mathe- 
maticians have speculated about the ontological precedence of continuum over discrete 
structures, and suggested that individuated forms are created by breaking up the contin- 
uous fabric of reality, and not the other way around. From the microgenetic viewpoint, 
we may invoke genetic 6 precedence of continuum over discrete structures, where cate- 
gorization and dynamic thematization act as organizing principles in breaking up the 
continuous fabric of reality into individuated forms. 

The idea of the genetic precedence of holistic fabric over individuated forms in the 
course of perceptual differentiation runs counter to standard cognitivist theories where 
form perception is basically viewed as a reconstruction from components (or elementary 
features), followed by the projection of the reconstructed object onto the internal screen 
of the mind (i.e. representation). It bears noting that the idea of a projection onto a men- 
tal screen is phenomenologically vacuous (i.e. provides no explanation of perceptual ex- 
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perience) and smacks of a homunculus (it takes a homunculus to contemplate the screen). 
Yet, since phenomenological issues seldom preoccupy the proponents of cognitivism, and 
they bluntly dismiss any alternative perspective - confident as they are that their theoret- 
ical stance will ultimately be corroborated by neurophysiological and psychological evi- 
dence - we may provisionally embrace their concerns and examine the issue of elemen- 
tary features. Because any real form can be decomposed into a countless number of fea- 
tures, and what makes a useful feature in one case may have little utility in another, one 
may wonder how the perceptual system is able to pick in advance the useful features of 
an as yet unreconstructed form. A way out of this problem might be to suggest the exis- 
tence of a finite set of generic features that could be made use of in the (re)construction 
of any possible form. But then there would be tremendous differences with respect to the 
ease with which various forms are reconstructed, and the task may even turn out to be im- 
practicable in the absence of an organizing principle, which, again, would have to be 
known in advance. Clearly, the proposition that perception is based on reconstruction 
from elementary components raises more problems than it may be expected to solve. 



PRESENT-TIME EXPERIENCE 

Optics, acoustics, chemistry, topology, as well as technological metaphors of photog- 
raphy, motion pictures, television or recording devices have, during the past century, 
greatly inspired scientific theorizing on perception. In their physicalistic fervor, genera- 
tions of psychologists and neuroscientists alike somehow lost sight of the very phenom- 
enological character of reality, let alone the necessity of explaining why present-time ex- 
perience has continuity and depth. Why is it that what occurs in present-time is not infi- 
nitely brief, that experience does not consist of a kaleidoscopic succession of discon- 
nected instants but has consistency and duration? Bergson, Husserl, and Merleau-Ponty, 
to mention the most outstanding authors, have penetratingly described and analyzed the 
issue of a non-evanescent present, of which my own description would be a pale rendi- 
tion. Let me underscore, nevertheless, that perception critically involves an enduring and 
consistent presence in experience. This presence signifies that there is a continuous 
structure to experience, or more properly, a continuous forward-oriented dynamics, so 
that the present-time is neither infinitely brief nor evanescent, but has depth (or thick- 
ness) and consistency stretching dynamically from its immediate predecessor to its an- 
ticipated successor. To use Husserl’s terminology, the now has retentions and proten- 
tions. Perception theorists who keep on brushing aside this continuous forward-oriented 
dynamics of present-time experience can be likened to conscientious parents who throw 
their baby out with the bathwater. Even if one were to regard the perceptual field as a 
kind of external memory - to quote a recent theory (O’ Regan, 1992) - where any part of 
the field is kept available for further inspection - this very availability critically depends 
on the continuous dynamics of a forward-oriented stretch of present-time. Were this not 
so, the issue of availability for further inspection would be pointless as, at each and every 
instant, the perceptual process would have to start anew. Whatever is present in experi- 
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ence is so by virtue of a process that dynamically extends in time. This presence in ex- 
perience is by no means illusory, if by illusory one implies something unreal, because 
the only reality available to us is the one we experience. 



DUAL DYNAMICS OF TIME: UNFOLDING AND DEPLOYMENT 

It is my suggestion that microgenetic theory provides an adequate framework for the 
explanation of this dynamic process. To show how, I shall first introduce the concept of 
autochrony which refers to the internal, unidirectional (i.e. forward-oriented), self-gen- 
erated time characteristic of the living process (Rosenthal, 1993). This self-generated, in- 
ternal time, which determines the flow of the living process and is proper to each living 
species, has a phenomenological and biological meaning. It confers on temporal dy- 
namics its intrinsic direction as well as the periodicity specific to each species 7 . It is at 
the very heart of the autonomy of action and provides the latter with its driving impulse. 
Note, indeed, that in order to be autonomous, rather than merely reactive, an action has 
to be self-generated. Yet, if we are to account for the continuous forward-oriented dy- 
namics of present-time experience, what is further required is the idea of a dual dynam- 
ics of microgenetic development, one of unfolding and one of deployment. Experience 
has consistency and duration because it has a developmental history, a history that di- 
achronically deploys and unfolds. Unfolding refers to the developmental succession of 
intermediate phases of ongoing experience, whereas deployment designates the fact that 
a figure has temporal extension, the time it takes to deploy in experience. This dual dy- 
namics of microgenetic development, whereby experience gradually unfolds through 
differentiation and the deployment of intermediate figures, and where successive de- 
ployments tend to occult their predecessors but not the very sense of developmental his- 
tory, confers on present-time experience its temporal depth and consistency. There is 
thus depth and consistency in the present-time because we sense the developmental his- 
tory of ongoing experience without being able, at least usually 9 , to evoke its intermedi- 
ate deployments, as they are occulted by the current occurrence of the present. 

The cohesiveness of gradually unfolding present-time experience depends also on the 
anticipatory and categorial character of microgenetic development. Categorization al- 
lows for the continuity of form identity throughout perceptual development, giving it co- 
hesiveness and stability. Anticipation, which should not be mistaken for the effective ex- 
pectation of definite objects or states of affairs, designates a protensive readiness for ac- 
tion: we actively anticipate and seek meaningful structures and immediately categorize 
them in view of prospective action. 



GRADUAL CHARACTER OF IMMEDIATE EXPERIENCE 



The hidden, gradual character of immediate experience attracted considerable attention 
on the part of the founders of microgenetic theory. The method of genetic realization 
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( Aktucilgenese ) was actually developed by Sander in order to externalize the course of 
microgenetic development by artificially eliciting ‘primitive’ (i.e. developmentally ear- 
ly) responses which are normally occulted by the final experience (Sander, 1930; Wern- 
er, 1956). In the field of visual perception, this was accomplished by repeatedly pre- 
senting very brief, poorly lit or miniature stimuli, and gradually increasing exposure 
time, improving lighting or letting the stimulus grow to ‘normal’ size. The subjects or, 
more appropriately, the observers were invited to describe what they perceived and felt 
as the experiment unfolded. Sander provided minute descriptions of these ‘primitive’ re- 
sponses, observing that “the emergent perceptual constructs are by no means mere im- 
perfect or vague versions of the final figure (...) but characteristic metamorphoses with 
qualitative individuality, ‘preformulations’ (VorgestaltenY’ (ibid, p. 193). He noted that 
in the course of an unfolding perception, the development does not amount to a steady, 
progressive improvement whereby each successive deployment is a more elaborate ver- 
sion of its predecessor that comes closer to the final percept. Rather, the development 
observed in Aktualgenese exhibits the characteristic structural dynamics at work in per- 
ception. “The formation of the successive stages, which usually emanate one from the 
other by sudden jerks, has a certain shading of non-finality; the intermediaries lack the 
relative stability and composure of the final forms; they are restless, agitated, and full of 
tensions, as though in a plastic state of becoming.’’ Moreover, “this structural dynamics, 

which ( )[is] one of the determining factors in the process of perception itself, enters 

our immediate experience in the form of certain dynamic qualities of the total ‘state of 
mind’, in emotive qualitative tonalities” (p. 194). 

The structural dynamics at work in an unfolding perception generates intense emo- 
tional involvement on the part of the experiencer. The perceptual development, artifi- 
cially externalized by the method of Aktualgenese , is not something the observer follows 
with cool objectivity and detachment, but “all metamorphoses are engulfed in a[n]... 
emotional process of pronouncedly impulsive and tensor nature, and take place through 
an intense participation of the whole human organism” (p. 194). There is an ‘inner urge’ 
for ‘formation of the ill formed’ and for meaningfulness. The intermediate deployments 
are thus experienced with a ‘peculiar feeling-tone’ correlated with the instability and 
non-finality of a given occurrence and are animated by the dynamics of what Sander’s 
gestaltist counterparts called Prdgnanz (the ‘urge’ for symmetry, regularity, homogene- 
ity, simplicity, stability...). The emotional involvement observed in genetic realization, 
which could be viewed as excessive in regard to an unremarkable object of actual per- 
ception, can nevertheless be experienced under ‘normal’ conditions. A picture hanging 
crooked on the wall can become unbearable and can literally shriek to be set straight. 

Werner gave a markedly similar description of these structural dynamics at work in mi- 
crogenetic development and of the intermediate deployments that are occulted by the fi- 
nal experience, placing an emphasis on total bodily feeling, emotional-kinesthetic dy- 
namics and action-like inner gestures. But he was more concerned with the semantic as- 
pects of microgenesis and specified the characteristics of meaning stabilization and dif- 
ferentiation (Werner, 1930; Werner, 1956; Werner & Kaplan, 1956; Werner & Kaplan, 
1963). In particular, he noted an early emergence of the general sphere of meaning, and 
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described the developmental dynamics of meaning structure as characterized by sphere- 
like deployments, where gradual differentiation is not necessarily accomplished by con- 
tracting the semantic sphere but also involves shifts of the ‘center of gravity’. Thus a tar- 
get ‘cigar’ may at one point elicit the “primitive” response ‘smoke’, at another, ‘cancer’. 
Some of the most interesting observations he made stemmed from neuropsychology 
where pathological behavior due to brain damage was described as an arrest of the mi- 
crogenetic process at an earlier stage of development so that patient’s responses took the 
form of unfinished ‘products’ which would normally undergo further development (see 
Conrad, 1954; Semenza, Bisiacchi, & Rosenthal, 1988; Werner, 1956). 



THEMATIC ORGANIZATION OF CONSCIOUS EXPERIENCE 

The foregoing descriptions of microgenetic experiments shed light on the non-uni- 
tary and gradual character of conscious experience. For one thing, the thematic organ- 
ization of conscious experience does not amount to mere contrastive juxtaposition 
where the theme (focal figure) is granted focal awareness and the ground is phenom- 
enologically empty. Background objects are not speechless; they hang together with 
the theme as a sort of supportive frame, yet each brings in its intentional horizon and 
thereby constitutes a potential landmark for alternative thematic organizations. More- 
over, thematic organization is not inherent to the field and is largely dependent on the 
subject’s engagement in action; accordingly, access to phenomenal sensations depends 
on this engagement in action. Thus, for instance, physically the same perceptual con- 
text can give rise to different reports depending on the type of action in which the sub- 
ject is involved (see e.g. Marcel, 1993). Finally, although the history of a microgenet- 
ic development is usually obscured by the final deployment, both the Aktualgenese ex- 
periments and the elusive fading impressions of intermediate deployments we some- 
times have (and which are not necessarily imperfect versions of the final figure) sug- 
gest that conscious experience develops gradually and that the organization of the the- 
matic field undergoes successive adjustments. These dynamic characteristics of con- 
scious experience bear witness to the importance of the concept of structural instabil- 
ity for the theory of immediate experience. 



PHYSIOGNOMIC CHARACTER OF PERCEPTION 

The overall dynamic structure of microgenetic development may also account for the 
physiognomic character of perceptual experience. Physiognomic means here that we per- 
ceive objects as “directly expressing an inner form of life” (Werner, 1957, p. 69), that is, 
in the same manner in which we experience physiognomies, facial expressions, gestures, 
or, more generally, acts of living beings. Following this line, perceived forms are not 
static morphological configurations but dynamic deployments, where the overall ‘dy- 
namic tone’ is part and parcel of the experienced percept 10 . Accordingly, all perceived ob- 
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jects, whatever their nature, partake of physiognomic qualities. The physiognomic char- 
acter of perception has been extensively discussed by Werner and also by Kohler (1938; 
1947) and Arnheim (1954; 1969), but the idea of physiognomic perception also prompt- 
ed a series of misunderstandings, most notably on the part of Gibson (1979) and Fodor 
(1964). It should be observed that two aspects of physiognomic perception must be tak- 
en into account. The first concerns the expressive character of percepts, and the second 
the conative dimension of perception whereby the readiness for action imbedded in per- 
ceptual experience ‘urges’ us to act upon, or use, perceived objects (see also the concept 
of gerundival perception in Lambie & Marcel, 2002). Gibson’s concept of ajfordance, 
an anglicized version of Kurt Lewin’s Aufforderungscharakter (invitation character), is 
partly grounded on this latter idea, though it doesn’t convey the sense of an urge to act 
but merely invokes an invitation. This urge to action is most readily observed in the be- 
havior of children, in so-called ‘primitive peoples’, and under the influence of certain 
drugs (Werner, 1957). Brain pathology gives an amazing example of the conative char- 
acter of perception in so-called utilization behavior where the patient cannot but use 
whatever object he or she happens to come across (Lhermitte, 1983; Shallice, Burgess, 
Schon, & Baxter, 1989). As Lhermitte observed, for the patient, the perception of an ob- 
ject implies the order to grasp and use the object. As the above example suggests, this 
pathological behavior is by no means an aberrant creation of pathology, but an expres- 
sion of the readiness for action imbedded in perceptual dynamics. In the social context 
of Western Societies, this readiness for action does not necessarily prompt effective ob- 
ject manipulation, at least in adult behavior, but in the context of certain brain lesions en- 
actment may become irresistible. 

The expressive character of perception is obviously no less imbedded in perceptual dy- 
namics. As Kohler and Arnheim cogently argued, the expressivity of the perceived world 
is directly experienced by the perceivers and does not result from empathic projection or 
from perceived analogy with their own past expressions and feelings. For one thing, we 
cannot simultaneously be external observers and the experiencers of our own interiority 
and exteriority. Flow could we then acquire the dual knowledge that would serve as the 
basis for an analogy? Second, the analogy could only hold between comparable entities 
or configurations; yet when we perceive a sad tree, a cheerful landscape, or the lovely 
face of Dorothee, this can hardly be due to the knowledge of our own expressions of sad- 
ness and cheerfulness, or, for the present writer, of his own loveliness. Clearly, an indi- 
rect principle, whereby perceived morphologies or dynamic configurations ( e.g . facial 
expressions, gestures) are subsequently interpreted by analogy or empathic projection, 
can hardly count as a satisfactory explanation of the expressivity of the perceived world. 
On the contrary, expressivity constitutes a forceful illustration of the dynamic principle 
at work in perceptual development whereby even the morphology of static forms is 
grounded in the configural dynamics of the deployment of the percept". 

The acknowledgement of the physiognomic character of perception shouldn’t be naive- 
ly interpreted to suggest that our everyday perception is overflowing with an expressive 
world where objects and landscapes are animated by inner life. As adult members of 
Western Societies we certainly do not find ourselves overwhelmed by the expressivity 
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of the perceived world - we generally barely pay attention to it - and many people will 
even be reluctant to admit that their perceptual experience has an expressive flavor. The 
expressivity of the perceived world is clearly at odds with the matter-of-fact style of our 
social world. In a labor-oriented society - where activity is governed by non-immediate 
goals and where, as Benny Shanon noted 12 , voluntary ignorance of a good deal of what 
we are otherwise able to perceive but what falls outside the tacitly agreed upon terms of 
relationships, is one of the founding components of interpersonal relations - physiog- 
nomic impressions normally recede to the background and form, at best, an elusive feel- 
ing tone. The perception of inanimate objects is no less affected by the style of our so- 
cial world, for we belong to social world even when alone. Yet, reports of physiognom- 
ic perception abound in child psychology, ethnopsychology and clinical psychology. 
Children, so-called ‘primitive peoples’ and, for instance, certain schizophrenics manifest 
in their behavior clearly identifiable reactions to the perceived expressive character of 
objects. Moreover, the ease and the naturalness with which we are receptive to expres- 
sivity in literature, painting or music would remain inexplicable were we not to assume 
that this receptiveness builds upon a disposition that was ‘already there’. Clearly, these 
observations testify to expressivity in perception. The acknowledgement of the physiog- 
nomic character of perception brings us closer to a scientific explanation of the origin of 
esthetic and ethical attitudes. Although for many students of cognition this issue is sec- 
ondary or falls beyond the scope of a scientific endeavor, I submit that the inability of 
cognitivist theories to account for the origin of esthetic and ethical attitudes, their failure 
to even perceive the fundamental status of esthetics and ethics in regard to human cog- 
nition, constitute some of their major shortcomings. It is certainly not irrelevant that 
Gaetano Kanizsa, whose work inspired this collection of essays and whose phenomeno- 
logical orientation resolutely opposed cognitivist approaches to perception, was deeply 
concerned with perception’s esthetic character as well as being an accomplished painter. 



GENETIC PHENOMENOLOGICAL SCIENCE OF COGNITION 

The basic constructs of microgenetic theory outlined so far may be viewed as land- 
marks for a genetic phenomenological science of cognition. A reader familiar with 
Gestalt theory will easily recognize in this overview the legacy of Wolfgang Kohler, in 
particular his idea of stabilization in dynamic system and his concepts of value and ex- 
pressivity in perception. These ideas are, however, reformulated to take into account 
temporal dynamics so as to be able to define cognitive process in terms of a dynamic de- 
velopment characterized by gradual differentiation and deployments, variable stabiliza- 
tion as well as unfolding and thematic focalization. This micro-development has an an- 
ticipatory and categorial character, to which, strangely enough, the gestaltists paid little 
attention. 

It should be stressed that the phenomenological character of microgenetic theory does 
not prevent it from being amenable to evaluation by the methods of natural science. 
Much as the original theory of Werner has built on ample experimental evidence, vari- 
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ous specific postulates of the present theory may be subjected to experimental evalua- 
tion in psychology and neuroscience where, incidentally, a variety of tools to probe brain 
dynamics have recently been devised. Note that experimental evaluation does not re- 
quire that phenomenology be naturalized. The very idea of an experimental phenome- 
nology is precisely to bring a naive openness on the part of the subject, uncontaminated 
by formal knowledge, to the experimental situation, with its precise physical measures 
and control of experimental variables. Similarly, a comparable naive openness is re- 
quired on the part of the scientist whose genuine questioning free of conceptual preju- 
dice is the only way to ‘get in touch’ with the original reality he seeks to describe, and, 
as such, is the necessary counterpart of his otherwise naturalistic stance (see Bozzi, 
1989; Rosenthal & Visetti, 1999; Vicario, 1993). In the next section, I shall briefly re- 
view neurophysiological, neuroanatomical, and eye-movement data and suggest that the 
bulk of available evidence is basically consistent with the notion of a global-to-local de- 
velopmental dynamics in visual perception and the idea of the gradual differentiation of 
the percept. However, before I turn to this data, I should like to point out that some of 
the most promising developments for microgenetic theory in the field of cognitive sci- 
ence might be sought in the use of the modern mathematical and physical concepts of in- 
stability, and in the application of the theory of complex systems in modeling the dy- 
namics of microgenetic differentiation (see also Visetti, this volume). 



EARLY STRUCTURE IN VISION 

There is a predilection among many vision scientists for the traditional atomistic ex- 
planation of visual perception according to which the putative percept is reconstructed 
at the level of the higher cortical structures from unstructured mosaic of elementary sen- 
sations that are produced on the retina and dispatched via retinofugal pathways to these 
cortical structures 13 . I shall argue, to the contrary, that the anatomical and physiological 
studies of the retinofugal pathways in primates support the proposition that considerable 
structure emerges already at the lowest levels of visual processes, and that these studies 
lend credence to the idea of holistic precedence, as well as, indirectly, to the overall 
schema of global-to-local structure of visual processes involving early categorization. 

Retinal projections to the cerebral cortex are dominated by two major pathways, the 
magnocellular (M) and parvocellular (P) systems, which are relayed by the magnocellu- 
lar and parvocellular subdivisions of the lateral geniculate nucleus (see Merigan & 
Maunsell, 1993; Shapley & Perry, 1986). The M ganglion cells have large soma, with 
extensive dendritic trees and large axons, whereas the P ganglion cells have smaller so- 
ma, small dendritic arbors and medium-size axons (see Leventhal, Rodieck, & Dreher, 
1981). It is important to note that the conduction velocity of M cells is greater than that 
of P cells due to the larger axonal diameter of M cells. Moreover, the M cells have large 
receptive fields, rapid temporal dynamics, and are more sensitive to low spatial fre- 
quencies. This system is sensitive to the coarse spatial distribution essential to the dif- 
ferentiation of basic form and for figure-ground segregation. The P system, which has 
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smaller receptive fields and is more sensitive to higher spatial frequencies, samples the 
retinal image with higher resolution that is relevant to local spatial detail and color. 
There seems to be a division of labor between the two systems such that the M system 
quickly processes coarse form and the P system subsequently specializes in fine detail 
and color. The two systems thus sense different but overlapping portions of visible spa- 
tial and temporal frequencies (Livingstone & Hubei, 1988). In a word, retinal ‘sensa- 
tions' are processed twice in a nonredundant fashion and each time using ‘data' in a dif- 
ferent format. Arguably, the M system provides a quick primal glimpse of the visual 
field, supplying sufficient structural information about gross spatial discontinuities and 
their position in the field to guide the processing of the P system. This enables the ocu- 
lomotor system 14 to adjust gaze position and generates dynamic displacement thereby 
creating spatio-temporal discontinuities to which the M system is also sensitive (see 
Lehmkuhle, 1993; Merigan & Maunsell, 1993). Two important points emerge from this 
description. (1) Since the magnocellular system has temporal precedence, the earliest 
processes in vision are necessarily global and coarse-grained. (2) As the M ganglion cells 
have large receptive fields and are sensitive to coarse spatial distribution, displacement 
and temporal dynamics, there are reasons to believe that the M system segments the vi- 
sual field on the basis of gross spatial and temporal discontinuities and of their joint dis- 
placement 15 . These observations strongly favor the proposition that the ‘stimulus’ 
brought by the magnocellular projection in VI (striate cortex) already has considerable 
structure. 

The notions of a global-to-local developmental dynamics in visual perception and of an 
early categorization of visual forms will, however, best be evaluated by combining the 
foregoing anatomical and physiological considerations with evidence from eye move- 
ment studies. It should be noted that the magnocellular system is mostly involved in ex- 
trafoveal (both parafoveal and peripheral) vision whose definition is insufficient for local 
detail, that it presumably exerts control on eye movements, and that foveal fixations (nec- 
essary for the exploration of local detail) are highly selective and cover only a small part 
of the visual field, mainly the figure (see O’Regan & Noe, 2002; Underwood, 1998; 
Yarbus, 1967). This selectivity is obviously inconsistent with the ‘mosaic theory’, at least 
as far as the whole visual field is concerned, for how can the visual system reconstruct the 
whole field when over 80% of its ‘elementary components’ are unavailable. But selectiv- 
ity is also interesting for other reasons. In order to act selectively a system has to have pri- 
or ‘knowledge’ on which to base the selection. In this case, the system has to spot the fig- 
ure first and, then, adjust the gaze so as to fixate parts of this figure. Now, several char- 
acteristics of what makes up a figure need to be recalled: (a) it is a form, (b) it is neces- 
sarily meaningful, and (c) it has thematic prominence with respect to the rest of the field. 
But how can this come about were we first to construct abstract forms out of meaningless 
features, only to discover later their identity and meaning? Obviously, such a form could 
not first be (re)constructed out of the mosaic of its local meaningless components, and 
then targeted for central fixation, because local components can only be explored when 
fixated in central vision. Incidentally, this observation lends further support to the above 
proposal that basic form emerges in early coarse vision. But since the form in question is 
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also a figure standing in the field, it comes accompanied by its semantic and thematic ex- 
tensions. One can hardly explain how a figure can be spotted in early coarse vision if vi- 
sual perception did not have a directly categorial and anticipatory character. Moreover, 
since early vision can at best provide a raw sketch of the figure, which may then be fur- 
ther explored in central vision, there are reasons to assume that the postulated schema of 
global-to-local coarse-grained-to-fine-grained differentiation in visual perception rests on 
firm grounds. 



SUBJECTIVE FIGURES 

Various phenomena of perceptual completion, whether figures, surfaces or regions, 
provide an interesting illustration of microgenetic dynamics at work in perception. Con- 
sider the famous example of the Kanizsa square where a collinear arrangement of edges 
of four white ‘pacmen’ (inducers) on a black background gives rise to the perception of 
a black square whose area appears slightly darker than the background. In addition, the 
surface of the square appears to the observer to be in front of four disks that it partly oc- 
cludes. Since the square is perceived in spite of the absence of corresponding luminance 
changes (i.e. forming complete boundaries), and thus does not reflect any real distal ob- 
ject, it can only be created by the visual system which purportedly completes, closes, and 
fills in the surfaces between ‘fragments’, so as to make the resulting ‘subjective’ region 
emerge as figure standing in the ground. Yet, as Kanizsa (1976; 1979) aptly showed, this 
and other examples of so-called subjective contours demonstrate the basic validity of 
Gestalt principles of field organization, in particular of its figure/ground structure and of 
Prdgnanz, whereby incomplete fragments are, upon completion, transformed into sim- 
pler, stable and regular figures. Although this phenomenon is often described in terms of 
contour completion, it clearly demonstrates a figural effect, whereby the visual system 
imposes a figural organization of the field (and hence figure completion), and where the 
contour results from perceiving a surface, not the other way around, again as Kanizsa 
suggested. Moreover, these subjective figures illustrate the categorial and anticipatory 
character of microgenetic development, such that the perceptual system anticipates and 
actively seeks meaningful structures and immediately categorizes them on a global dy- 
namic basis 16 . The crucial role of meaningfulness is demonstrated by the fact that no sub- 
jective figures arise in perception when the spatial arrangement of inducers does not ap- 
proximate a ‘sensible form’ or when the inducers are themselves meaningful (viz- com- 
plete) forms 17 . 

What makes these subjective figures even more valuable for the present discussion is 
that they may be viewed as an instantiation of early structure and of holistic precedence 
in visual development. In recent years, there has been considerable debate in vision sci- 
ence concerning the neural mechanism underlying perceptual filling-in and several re- 
searchers have claimed to have identified subpopulations of cortical cells specialized in 
various aspects of perceptual completion (see e.g. Lesher, 1995; and Pessoa, Thompson, 
& Noe, 1998, for a review and critical discussion). One problem with these postulates of 
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low-level filling-in neural mechanisms is the frequent confusion between what pertains 
to receptive field dynamics related to blind spot or scotomata and what pertains to the 
perception of genuine subjective figures. Another problem is that finding neurons whose 
behavior correlates with the perception of subjective figures does not imply that these 
neurons are actually responsible or even used for perceptual completion. The next prob- 
lem is that no subpopulation of specialized cells can account for the fact that subjective 
figures are always sensible meaningful forms. Finally, many studies have tended to over- 
stress the importance of contour (which as Kanizsa showed is secondary to surface per- 
ception) and thus assumed a critical role for collinear alignment of edge inducers when 
actually such alignments are not a necessary condition (as the Sambin/Kanizsa cross ex- 
amples demonstrate, see Figure 1 below, and Kanizsa, 1976). It is important to note in 
this respect that there is presently a considerable body of neurophysiological and neu- 
ropsychological evidence supporting the idea that surface formation and completion, in- 
volving context-dependent figure/ground segregation, occurs very early in the course of 
vision and on global basis (Davis & Driver, 1994; Lamme, 1995; Mattingley, Davis, & 
Driver, 1997). This evidence confirms Kanizsa’s results and further corroborates the mi- 
crogenetic postulates of the dynamic, directly categorial (viz. meaning-laden) and antic- 
ipatory character of field organization. Although many scientists among the neuro- 
science intelligentsia continue to favor a modern version of the helmholtzian doctrine 
according to which the percept (here the subjective figure) is reconstructed at the level 
of ‘sentient’ higher cortical structures from an unstructured mosaic of elementary sensa- 
tions processed by specialized local detectors, I submit that the above examples and dis- 
cussion provide powerful arguments in support of the microgenetic theory of perceptu- 
al development outlined in this essay. 



THE MICROGENESIS OF VISUAL PROCESSES IN READING 

I shall turn now to a specific illustration of certain principles of microgenetic theory 
in the field of reading. 1 have chosen reading because it is a peculiar skill. It takes both 
language and perception to become a reader, yet language and perception won’t suf- 
fice; some people never become proficient readers, and a brain lesion can disrupt read- 
ing skills in subjects otherwise showing no defect in object perception and spoken lan- 
guage. The persistence of oral civilizations and of nonliterate societies further teaches 
us that not any form of social world is appropriate for the advent of literacy. Moreover, 
the passage from nonliterate to literate society deeply alters syntax, vocabulary and 
language use, as well as the mnemonic and cognitive practices of society members, 
and, ultimately, the society itself. At the same time, reading is interesting for our pur- 
pose as it handily lends itself to the evaluation of the postulates of immediate catego- 
rization and meaningfulness, selectivity, and the global-to-local structure of perceptu- 
al development. 

Language and perception are unsettling accomplices of literacy. Their relationship in- 
volves a kind of co-determinism where it is difficult to regard written language as a sim- 
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Figure 1. Sambin-Kanizsa cross examples. Note that the subjective surface in the center of the thin cross (left) 
has the form of a circle whereas the subjective surface in the center of the thick cross has the form of a square. 



pie externalization of information, lying ‘out there’ and waiting to be identified and rein- 
ternalized. Wertheimer once observed that for a proficient literate individual it is neither 
necessary to identify each individual letter nor to overtly recognize every word while 
reading a text. In 1913, he and Potzl reported on an alexic patient who had “lost the abil- 
ity to perceive words as gestalts”. In spite of his preserved capacity to recognize indi- 
vidual letters, the patient was practically unable to read words; his preserved ability to 
identify component letters (elementary segments) along with the inability to recognize 
words (functional wholes) - which, incidentally, proved to be unresponsive to training - 
constituted in Wertheimer’s view an illustration of a gestalt organization in reading. 

Although Wertheimer did not elaborate any further on this organization, and his ob- 
servations remained quite general, he clearly alluded to the difficulty that would con- 
front a theory of perceptual processes in reading stated in terms of (mechanical) unit 
identification and conversion. On the one hand, typical silent reading (in orthograph- 
ic writing systems) can neither be characterized as literal (not all letters are identified) 
nor as purely holistic (letters still matter for reading, and not all words are overtly rec- 
ognized). On the other hand, many letters and words are typically left unidentified in 
the course of proficient reading 18 . It is thus patent that the metaphors of unit (whether 
letters or whole words) identification and conversion, which fed the century-long de- 
bate between the proponents of letter-by-letter or direct whole word recognition in 
reading, are unenlightening 19 . For the contents of the perceptual experience that un- 
derlies reading are not ‘out there’ on a sheet of paper waiting to be detected and in- 
ternalized (in the form of mental representation). Since, on the one hand, the contents 
of experience in reading are not ‘out there’ waiting identification, internalization or 
conversion, and, on the other hand, a text represents a highly elaborated yet very com- 
pact material for experience, reading may serve as a living small-scale model of im- 
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mediate experience illustrating the various issues relevant to microgenesis reviewed 
earlier in this essay. 

An interesting source of insights into reading comes from so-called ‘deep dyslexia’, a 
reading pathology due to brain damage where patients are characteristically unable to 
identify letters and read aloud pronounceable nonwords. They preserve nevertheless the 
ability to read words, though to a variable degree: nouns, verbs and adjectives are read 
best whereas conjunctions and articles are seldom read aloud. In reading content words, 
they quite often make semantic paralexias (e.g. reading ‘priest’ for ‘church’) and some- 
times, but rather infrequently, visual (e.g. ‘deep’ for ‘deer’) or derivational (e.g. ‘regis- 
tered’ for ‘register’) errors (Coltheart, Patterson, & Marshall, 1980). One thing that is 
striking about observations on deep dyslexia is that they illustrate a condition in which 
perceptual-morphological, syntactic and semantic aspects of reading are all interwoven. 
Note, for instance, that in order to read ‘priest’ for ‘church’, words that visually have al- 
most nothing in common, some perceptual-morphological processing of the printed 
word ‘church’ is necessary. This processing must, however, be insufficient for the overt 
identification of the target word, yet it must be sufficient to hit upon the sphere of mean- 
ing relevant to ‘church’ so as to allow the patient to respond using the word ‘priest’ . How 
could this occur where meaning and form alien to one another? On the other hand, it 
should also be borne in mind that, were the target a function word, chances are that a 
deep dyslexic patient would not be able to read it out loud. But how can he know that 
the target is a function word in so far as he is unable to overtly identify it? Clearly, in 
this example, form, meaning and function cannot be independent and mobilize process- 
es that are intrinsically alien to one another. 

The above example is also remarkably reminiscent of observations described by Wern- 
er (1956) and Conrad (1954) in which pathological behavior due to brain damage was 
presented as an arrest of the microgenetic process at an early stage of development, 
thereby letting occur unfinished ‘products’ in patients’ behavior, that would normally un- 
dergo further development. Moreover, an examination of patients’ semantic ‘errors' pro- 
duced for the same target word, whether in the same reading session or in different ses- 
sions, shows the same character of instability, sphere-like deployments and shifts of 
‘center of gravity’ as those described by Werner with respect to Aktualgenese experi- 
ments conducted with normal subjects. 

In a series of experiments undertaken recently in my laboratory we sought to further 
evaluate the postulates of immediate categorization and meaningfulness, selectivity, and 
global-to-local structure of microgenetic development in reading. These experiments 
were mainly intended to probe the structure of visual processes in reading but since read- 
ing normally applies to meaningful texts, other issues related to text interpretation, grad- 
ual development of meaning, and meaning and form relationship arose as well. In par- 
ticular, we sought to evaluate the general hypothesis of a global-to-local structure of vi- 
sual processes in reading by picking a specific instantiation of this hypothesis in terms 
of the selective processing of component letters depending on their orthographic ally dis- 
criminative character. The basic idea underlying this was the following: the selective 
processing of component letters that depends on their orthographically discriminative 
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character presupposes prior identification of letter slots which are ambiguous and thus 
need discrimination. However, in order to determine which letter slots are ambiguous, it 
is first necessary to gather at least some ‘knowledge’ about the class of word shapes to 
which the target word belongs. Only a global word-shape-based process can bring about 
such ‘knowledge’. Moreover, the very existence of letter processing after a prior word- 
shape preview implies that the latter global process is too coarse-grained or otherwise in- 
sufficient for word identification 20 . In this sense, the idea of the selective processing of 
component letters depending on their orthographic discriminativity lets us evaluate the 
underlying hypothesis of the global-to-local structure of perceptual differentiation in 
reading. 

A few words of clarification may be needed here. What defines the ambiguous or 
critical letter slots in a word is the existence of its orthographic shapemates, i.e. oth- 
er words sharing the same global shape (global word-form irrespective of internal let- 
ter features) but which differ locally with respect to component letters that occupy 
these slots. Of course, the letters in question are of similar stature (ascender to ascen- 
der, descender to descender, etc...); otherwise the words would not share the same 
global shape or be similar in regard to global word-forms. Because the primary glob- 
al process is assumed to be coarse-grained, orthographic similarity is defined by the 
similarity of global word-forms at the level of spatial resolution which is insensitive 
to internal letter features. It is only in a later phase, and if the reader seeks local dis- 
crimination, that the visual system becomes sensitive to fine-level internal features. 
Thus, for instance, the fifth letter slot (r) in the French word ‘effarer’ is ambiguous due 
to the existence of another word ‘effacer’ which shares with the former the same glob- 
al shape (it is its shapemate) and the two words therefore can only be distinguished 
from one another by checking locally the fifth letter slot (r vs. c). On the other hand, 
there is no ambiguous letter slot in the French word ‘migraine’ because no other 
French word shares its global shape. It can then be said that the fifth letter ‘r’ in ‘ef- 
farer’ is discriminative and hence critical for the identification of this word, whereas 
the ‘r’ in ‘migraine’ is noncritical because ‘migraine’ has no shapemates from which it 
would have to be distinguished. 

The experiments which were conducted in order to evaluate the hypothesis of the se- 
lective processing of component letters were based on the letter cancellation technique 
or on the analyses of eye movements. 

The letter cancellation technique (Corcoran, 1966; Healy, 1994) requires subjects to 
cross or circle each instance of a specific letter while reading a text for comprehension. It 
has been used to study the issue of perceptual units in reading (Drewnowski & Healy, 
1977; Hadley & Healy, 1991; Healy, 1976) in relation to the effect of linguistic function 
(e.g. content vs. function words) on letter detectability (Greenberg & Koriat, 1991; Kori- 
at & Greenberg, 1991; Koriat & Greenberg, 1994), and in relation to the phonological sta- 
tus (e.g. pronounced vs. silent) of component letters (Corcoran, 1966). Studies based on 
this technique have shown that subjects always miss a certain amount of target letters 
while reading real text and that the rate of omission depends on certain parameters. For 
instance, letters in function words are more often missed than letters in content words 
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(Greenberg & Koriat, 1991), silent letters remain undetected more often than pronounced 
letters (Corcoran, 1966) and word meaning may also influence letter detection (Moravc- 
sik & Healy, 1995). Since the misdetections of target letters reported in these studies are 
quite systematic and take place in spite of efforts to detect all instances of these letters, it 
is assumed that they are indicative of the characteristics of the reading process. 

In all experiments based on letter cancellation, the critical comparison was obtained by 
contrasting words in which the substitution of the target letter (most often ‘s’ and V) cre- 
ates at least one orthographic shapemate, and words where no substitution of the target 
letter makes an existing word. The main prediction of these experiments was that be- 
cause discriminative letter slots are likely to be targeted for local verification and there- 
by come to the center of local attention, the detection rate of target letters, which are crit- 
ical to shapemate word differentiation, should be substantially higher than that of non- 
critical target letters, or, alternatively, that subjects should miss many more non-critical 
targets than critical ones. The main result of these experiments was that component let- 
ters that differentiate orthographic shapemates are better detected than letters that are in 
unambiguous slots (which give rise to twice as many detection errors). This critical-let- 
ter effect was obtained on five different passages of prose, as well as on meaningless 
scrambled assemblies of words. Moreover, it was found to criterially depend on ortho- 
graphic similarity: the effect did not occur when orthographically legal letter substitu- 
tions altered word shape (e.g. ‘merite’ vs. ‘medite’). These findings unequivocally cor- 
roborate the general idea that local letter-level analyses are pretuned by an earlier glob- 
al process and thus they lend support to the hypothesis of the global-to-local structure of 
perceptual differentiation in reading 21 . 

Experiments based on the analyses of the eye fixations of subjects reading various 
types of text provided independent evidence of the above critical-letter effect and 
brought additional insights into the structure of visual and interpretive processes in 
reading that are relevant here. First, the results substantiated the hypothesis of the or- 
thographic determinants of fixation locations in words by showing a systematic rela- 
tionship between the distribution of fixation locations and the presence or absence of 
orthographically discriminative letters: eye fixations tended to land on the area of dis- 
criminative letters in words that have orthographic shapemates and to spread over the 
body of words with unambiguous shapes. Second, these results showed that the pres- 
ence or absence of orthographically discriminative information does not affect the 
probability of fixating a (content) word: readers fixated just as much words that have 
orthographic shapemates and words that have unambiguous shapes. Third, the results 
showed that while reading normal two-page texts, subjects centrally fixated only 44% 
of the words. 

The finding that an orthographically ambiguous word ( i.e . word having shapemates) 
will not necessarily be fixated shows that the ‘decision’ of whether to fixate a word is 
not governed merely by orthographic considerations (e.g. the search to explicitly iden- 
tify words) but by the ongoing process of text comprehension (see also Balota, Pollat- 
sek, & Rayner, 1985; Ehrlich & Rayner, 1981 ; Rayner & Well, 1996). This conclusion, 
along with the consideration that less than 50% of words were centrally fixated in our 
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experiments, suggests that parafoveally-gained information may be sufficiently mean- 
ingful for it to take part in text interpretation. This is consistent with the proposition 
that meaning goes hand-in-hand with perceptual categorization, developing along the 
same lines from general to specific, from relatively vague and global to articulate, pre- 
cise, and local. 

The finding that readers fixate not only words that have ambiguous shapes but also 
words that are unambiguous - whether because they have orthographically unique 
shapes and/or due to strong contextual evidence - suggests that definite word identifi- 
cation does not take place in parafoveal vision. We may thus ask the following ques- 
tion: if only words gaining foveal fixation are explicitly identified, how is it that sub- 
jects can skip more than 50% of words and still properly understand a text, as is shown 
by their ability to correctly answer questions about the content of what they have read? 
It is noteworthy that in our experiments this skipping very often concerned content 
words - one out of three 7-10 letter words (mainly content words) were not fixated by 
our subjects - and cannot therefore be attributed to a word class effect (e.g. certain 
function words being selectively skipped because they are highly predictable on syn- 
tactic grounds). Clearly, these results indicate that text comprehension in reading does 
not require explicit identification of all component words. This is not to suggest that the 
meaning of words that are not explicitly identified is simply ignored. Although 
parafoveal inspection does not allow for explicit word identification it does appear to 
feed the ongoing text comprehension process with adequate information (see also Lav- 
igne, Vitu, & d’Ydewalle, 2000). This information may only be partial or incomplete 
from the point of view of a dictionary definition of word meaning, but it nevertheless 
appears to be contextually appropriate and sufficient for the comprehension of a given 
text 22 . In any case, if parafoveal inspection can both constrain word discrimination and 
inform the process of text comprehension, there are grounds in psychology for the con- 
cept of immediate coarse-grained categorization of (printed) forms that is directly 
meaning-laden (due in part to a form/meaning relationship). This is precisely what mi- 
crogenetic theory stipulates. 

The foregoing example is admittedly no more than a partial illustration of the applica- 
tion of microgenetic thinking to the research context of reading. Beyond its relevance for 
a theory of visual processes in reading, the primary purpose of this illustration was to 
show that the microgenetic theory offers a viable and productive research strategy. The 
subversive quality of the theory, which even on partial and fairly local application forces 
a deep revision of the field of reading research, shows that microgenesis is not a mere 
collection of local hypotheses, that it makes a coherent though as yet emergent frame- 
work for the study of lived experience. After all, the issue at stake is genetic phenome- 
nological science of embodied cognition. 

Prof. Victor Rosenthal 
Centre Baul Broca 
2ter rue d’Alesia 
75014, Paris, France 
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Notes 

1 It should be underscored that microgenesis shared with the Berlin School of Gestalt theory some of its ba- 
sic tenets ( e.g . the concept of field, the idea of stabilization in a dynamic system) and its phenomenological 
orientation. However, it proceeded in directions neglected by the gestaltists: it focused on fine-grained tempo- 
ral dynamics of psychological processes and on the categorial character of meaning and perception; postulat- 
ed that perceptual experience is directly meaning-laden and intrinsically emotional, that forms are inherently 
semantic, and not merely morphological constructs. In contrast to the Berlin group, early work on microgene- 
sis was highly concerned with language and language development, and with cognitive disorders due to brain 
damage. 

2 Action should be distinguished from mechanical reaction. What characterizes genuine action is that it im- 
plies the autonomy and spontaneity of an agent, and a knowledge of the environment in which the action will 
take place. Indeed, the very possibility of autonomous and spontaneous action is ipso facto a demonstration of 
the agent’s of knowledge of the environment. To put it in terms of a phenomenology of action (and indeed al- 
so of living): doing is a basic form of knowing (see Arendt, 1958; Whitehead, 1983). 

3 The popular flowchart models (see e.g. Shallice, 1987; Shallice, 1988) where in order to acquire meaning, a 
semantically vacuous categorical percept has to access so called ‘semantic memory’, and where various ‘se- 
mantic effects’ are dealt with by invoking the concept of ‘level of activation’, do not provide a better solution 
to this problem. For, if semantics postdates morphology in the course of perception, and the latter is independ- 
ent of the former, no room is left for the influence of meaning and value upon the size of perceived objects. 

4 In line with Gestalt tradition, the microgenetic theory assumes that perception generically instantiates the 
structure of cognition. 

5 Correlatively, it thus becomes understandable why cognitive and perceptual processes are not infallible. Al- 
though microgenesis is globally adequate for our conditions of living, its anticipatory and directly categorial 
character conditions its potential failures. Accordingly, the observation that cognitive, perceptual or language 
processes are intrinsically fallible becomes a source of insights into the structure of cognition (see Rosenthal 
& Bisiacchi, 1997). For instance, the obstinate resistance of ‘perceptual errors’ to contradictory evidence hand- 
ily illustrates the ‘cost’ of the anticipatory and directly categorial character of microgenetic differentiation. 

6 Genetic refers here to the developmental dynamics of a process, not to a genome or to an adjectival use of 
the metaphor ‘genetic program’. 

7 It goes without saying that it is not the real line that can formally represent autochronic time. Self-genera- 
tion of time can only occur by fits and starts (or by pulsing) with variable periodicity. 

8 We are concerned here with the tentative explanation of the dynamics of processes in themselves. The read- 
er should, however, be aware that the general proposal bears on the dual structure of autochronic time. 

9 We may sometimes have an elusive, fading impression of intermediate deployments which nevertheless es- 
capes thematization however much we strive to bring it to conscious inspection. 

10 For instance, Werner noted that colors are experienced not only in terms of hue, brightness, and saturation 
but also in terms of being strong or weak, cool or warm; lines not only have extent and curvature, etc., but may 
be seen as gay or sad. . . 

11 Note that physiognomic perception further instantiates the value-laden character of perceptual experience 
which I discussed in the initial sections of this essay. The perceptual world is indeed directly invested with val- 
ues by virtue of the same dynamic principles that confer ‘interiority’ on perceived objects and dynamic config- 
urations and urge perceivers on to action. Accordingly, values are not indirectly associated with objects on the 
basis of past experience and/or rational evaluation (though of course in certain particular situations an object 
may be valued on the basis of rational evaluation) any more than expressive qualities are inferred by analogy. 

12 (Shanon, 1982). 

13 This idea is even presented unquestioned in recent handbooks (see e.g. Palmer, 1999). The logical diffi- 
culties with which this ‘mosaic theory’ is confronted are hardly mentioned. 

14 The magnocellular system appears to exert control on eye movement. 

15 Although the M cells are often described as detectors of movement, it should be borne in mind that spatial 
and temporal discontinuities induced by movement are the very condition for form perception. Indeed, self-in- 
duced (eye and/or head) movements are necessary for seeing, and the ‘static retina’, i.e. when eye movements 
are prevented or artificially compensated for, is blind (see Yarbus, 1967). Incidentally, this latter observation 
was anticipated by Husserl and Merleau-Ponty. 

16 The global character is obvious since the figure cannot be constructed from its components. The dynamics 
can be explained by the co-presence (or co-occurrence in time) of inducers (fragments) and their joint dis- 
placement upon self-induced movement (e.g. eye-movement). 
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17 Note also that although subjective figures are often illustrated by geometrical forms, geometric regularity 
is unnecessary, and any sensible figure, even irregular, can arise under similar conditions. The phenomenal 
completion is thus not an effect of Euclidean principles encoded in the brain. 

18 Rosenthal, Parisse, and Chainay (2002) showed that subjects skip ( i.e . do not fixate in central vision) more 
than 50% of words while reading regular texts. 

19 It bears noting that the so-called interactive solution (viz. interactive recognition of letters and whole 
words, see e.g. McClelland & Rumelhart, 1981; Rumelhart & McClelland, 1982) is in this respect no solution 
at all as it presupposes an a priori segmentation into relevant units. 

2(1 The foregoing formulation should not be interpreted literally as suggesting a two-stage (first global, then 
local) theory of visual processes in reading, which, let it be said in passing, would be at odds with the micro- 
genetic theory of gradual development. It simply intends to instantiate the idea of the temporal precedence of 
global coarse-grained over local selective and fine-grained differentiation. 

21 One may notice, on the other hand, that the necessity of differentiating words having the same global shape 
presupposes the prior occurrence of a process that categorizes words on the basis of their global shape. In this 
sense, these results corroborate the proposition that perceptual differentiation involves immediate categorization. 

22 Since overt identification of all words is not necessary for contextually appropriate text comprehension, the 
proportion of words being explicitly identified may vary depending on strategic attentional factors, the type of 
text being read, and the individual’s interests and reading skills (see also the concept of the effective visual 
field in Marcel, 1974, and the discussion of the use of context). 
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LANGUAGE, SPACE AND THE THEORY OF SEMANTIC FORMS 



1. INTRODUCTION 

Phenomenological and Gestalt perspectives have become increasingly important in 
linguistics, which should lead to better exchanges with semiotics and cognitive sci- 
ences. Cognitive linguistics, and to a certain extent what is known as linguistique de 
Venonciation, have led the way 1 . They have each in their own way established some- 
thing of a Kantian schematism at the center of their theoretical perspective, develop- 
ing on this basis what we might call a theory of semantic forms. They have introduced 
genuine semantic topological spaces, and attempted to describe the dynamics of the in- 
stantiation and transformation of the linguistic schemes they postulate. It is thus pos- 
sible, up to a certain point, to conceive the construction of meaning as a construction 
of forms, and in so doing, to analyze resemblances and differences between these var- 
ious processes. As a result, the idea of grammar itself has been modified, and centered 
upon a universal linguistic schematism, which supposedly organizes the values of all 
units and constructions. At the same time a certain understanding of the phenomenon 
of polysemy has been obtained, at least as far as this grammatical level is concerned. 

However, a closer analysis reveals a number of difficulties, which call for a better under- 
standing of what a genuine phenomenological and Gestalt framework should be in seman- 
tics. First, if we agree with the fact that there is a privileged relation, or some kind of sim- 
ilar organization, between language and perception, we should make more precise the gen- 
eral theory of perception (and jointly of action!) which we take as a reference. Secondly, if 
we also agree with the idea of a specifically linguistic schematism, analog to, but different 
from, what is needed for ‘external’ perception-and-action, its realm of dimensions should 
be determined: but we note here that there is a real, important disagreement between the au- 
thors. Thirdly, if we view language activity as a construction of genuine, ‘internal’ seman- 
tic forms based on linguistic schemes, it is obvious that polysemic words should correspond 
to transposable and plastic schemes: but the works we have just evoked remain very vague 
on this point; most of the time they propose lists of cases rather than genuine transposition 
and/or transformation processes. As a matter of fact, very few authors consider polysemy 
as a fundamental property of language which should be taken into account by linguistics 
from the very beginning. 

Furthermore, all these approaches acknowledge the importance of the spatial and/or 
physical uses of linguistic units, i.e. those uses which seem to be exclusively dedicated 
to qualify the topological, geometrical or physical structure of the tangible world. But 
now a question arises: what is the relationship between these uses, and all the other us- 
es of the same units, which, depending on the context, can signify a great variety of 
meanings? For instance, what is the ‘logic’ connecting the different uses of the English 
preposition ON, like in book on the table (spatial use), departure on Monday (temporal), 
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tax on income or to count on one’s friends (‘support’ or ‘foundation’)? Should we con- 
sider that the spatial or physical values of ON are in a sense a basis for all the others? 
Are they more typical? Or should we put all the uses on the same footing, and derive the 
various meanings from a single more generic principle? 

In this paper we will show how to escape these false dilemmas, and how to better as- 
sess the continuity between the perception of the tangible world, and the perception of 
the Semantic Forms upon which we intend to build a theory. Starting from the key ques- 
tion of prepositions and of the relation between their spatial and less- or non spatial us- 
es, we shall try to put forth general semantic principles, applicable to all categories of 
words and constructions (section 2). After that (section 3), we shall come back very 
briefly to Gestalt and phenomenological theories of perception, stressing the fact that 
they are semiotic theories, and not only morphological or ‘configurational’ theories of 
perception. As an immediate application to semantics, we will show the interest of this 
kind of approach to clarify the meaning of other categories of polysemic words (e.g. 
nouns). We shall then propose (section 4) - but in a very sketchy way - some general 
postulates for a microgenetic theory of Semantic Forms, based upon the mathematical 
notion of instability. The theory postulates 3 layers of meaning (or ‘phases’ of stabi- 
lization), called motifs, profiles, and themes. Taken together, they shape linguistic struc- 
ture and semantic activity. They apply in exactly the same way in lexical as well as in 
grammatical semantics. Actually, they are conceived in the perspective of being inte- 
grated more tightly into a global textual semantics, very akin to the one developed by 
F. Rastier (1987, 1989, 1994, 2000). Finally, we come back in conclusion to what 
should be the nature and place of grammar in a theory of Semantic Forms. 

This paper motivates and sketches a theory of Semantic Forms, which is a joint work 
with R Cadiot, arising from our common interest for semantics, Gestalt theory, phe- 
nomenology, and complex dynamical models (e.g. Visetti 1994, 2001; see also Rosen- 
thal and Visetti 1999, 2003). Examples and their specific analyses - sometimes slightly 
reformulated - have been taken from P. Cadiot’s previous works. We propose here a syn- 
thesis of several previous publications, with a special stress on the relation between lan- 
guage and space, and on the grammatical dimensions of meaning. The semantics of 
prepositions, and more generally grammatical semantics, should be considered as a very 
important starting point, and a first application of our theory. However our real purpose 
is much more global, and goes beyond that: we try to put from the very beginning - at 
least at a theoretical level - the whole semantics under the pressure of a fully dynami- 
cal, discursive, and diachronic perspective. The interested reader will find a much more 
detailed presentation in our recent book (Cadiot & Visetti 200 1) 2 . 



2. FROM SCHEMES TO MOTIFS: THE CASE OF PREPOSITIONS 



All the different trends in Cognitive Linguistics have placed the question of grammar 
in the foreground of their works, and have developed specific and original conceptions 
of it. As a matter of fact, they have severely criticized the autonomy of syntax postulat- 
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ed by generative linguistics in the line of Chomsky’s work. But they have maintained a 
clear cut separation between structure and content : ‘structure’ refers to a central and uni- 
versal schematic level of meaning, called grammatical, which extends to all units and 
constructions; ‘content’ refers to all the remaining dimensions (concepts, notions, do- 
mains...), specifically brought by the lexicon. Grammar is therefore a kind of imagery, 
a way of structuring, of giving ‘configurations’ to all semantic domains, and also to the 
‘scenes' evoked by speech. Imagery includes: 

• structural organization of ‘scenes’ (space, time, movement, figure/ground or 
target/landmark organization, separation between entities and processes) 

• perspective (point of view, ways of going over the scene) 

• distribution of attention (focusing, stressing) 

• and, for Talmy or Vandeloise (not for Langacker), some less configurational dimen- 
sions, like the system of forces, or dimensions like control, or access. 

For all these authors, this kind of schematism is specific to language (e.g. topological, 
not metric), but has many common properties with perception of external space. 

Most often there is a trend towards relying on a very general psychological prototype, 
according to which language, at its most fundamental level, encodes tangible and/or 
physical structures. Therefore, in order to describe all kinds of categories of words, lin- 
guistics should favor spatial and/or concrete uses, and even take them as a primary ba- 
sis for all the other ones. This idea leads in cognitive semantics, and also in grammati- 
calization theories, to a hierarchy of meanings, which starts from spatial or physical val- 
ues, taken as literal meanings, up to temporal or abstract meanings, which are supposed 
to be derived from the previous ones by some kind of metaphorical transfer process. 
However, authors like Lakoff, Langacker, Talmy or Vandeloise underline that these pri- 
mary values proceed from specifically linguistic schemes , which should not be confused 
with perceptive ‘external’ structures: indeed they are far more schematic, and at the same 
time genuinely linguistic, since for example they shape space by introducing ‘fictive’ 
contours or ‘fictive’ motions (Talmy). But in spite of these very important addings, the 
primacy (and/or the prototypical status) of a certain kind of spatial and physical mean- 
ings is not really questioned. Furthermore, schematical relations between language and 
perception often rely on a very peculiar conception of the spatial and physical experi- 
ence, which fails to appreciate the true nature of what the phenomenological tradition 
names the ‘immediate experience’ of subjects. It amounts to a reduction of this ‘imme- 
diate experience’ to a purely external space, and to a purely externalized physics of 
‘forces’, both separated from their motor, intentional and intersubjective (even maybe 
social and cultural) sources. In this external space, language would identify relations be- 
tween ‘trajectors’ and ‘landmarks’, conceived as independent, separate, individuals or 
places, entirely pre-existing to the relations they enter in. 

We think that this type of analysis extends to semantics a very questionable conception 
of perception, which stems from ontological prejudices, and not from rigorous descrip- 
tions. As a consequence of this wrong starting point, some works in the field of gram- 
mar retain only a very poor and abstract schematism; while others, or even sometimes 
the same works, address only the spatial or physical uses, hoping that the thus created 
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gap between these uses and all the others will be filled by an appeal to the magical no- 
tion of metaphor. 

More precisely, concerning the type of those linguistic schemes currently postulated by 
LC, and their relation to our external, everyday perception, two main attitudes can be 
distinguished: 

- sometimes (Langacker, particularly) the realm of dimensions prescribed by the his- 
torical Kantian framework is centered on purely abstract ‘configurational’ dimensions 
(abstract topology, abstract dynamics); those dimensions are supposed to be a permanent 
and obligatory basis of language in all semantic domains; on the contrary, dimensions 
like ‘forces’ (and a fortiori dimensions like interiority, animacy, agency,...) are consid- 
ered as less grammatical, secondary dimensions, coming only from more or less proto- 
typical uses (e.g. referring to the external perceived space); they can only add themselves 
to the configurational dimensions, and never ‘neutralize’ them 

- sometimes the realm of dimensions is not reduced (Talmy, Vandeloise); but this realm 
is considered primarily as part of our experience of the external physical world; spatial 
uses are more than typical, they are the primary ones; and all other uses are considered 
to be derived by a kind of metaphorical process 3 . 

With the semantics of prepositions, we find in a particularly striking form the problem 
of the relation to space and to the physical world. We shall take this example as a fun- 
damental illustration of the ideas we intend to put forth in this paper. Indeed, the ap- 
proach we advocate is deeply different from those we have just evoked 4 . It aims at go- 
ing beyond these kinds of schematism, while keeping some of their ‘good’ properties. 
The exact abstraction level as well as the interior diversity of each scheme are a first key 
matter. On the one hand, abstract topological and/or cinematic characterizations (call 
them ‘configurational’) are too poor. On the other hand, schemes weighted from the be- 
ginning by spatial or physical values are too specific, and furthermore rely on a very pe- 
culiar conception of spatial and physical experience. Actually, more ‘intentional’ or 
‘praxeologic’ dimensions, intuitively related to ‘interiority’, ‘animacy’, ‘expressive- 
ness’, ‘appropriation’, ‘control’, ‘dependence’, ‘anticipation’ etc. are needed. By enter- 
ing in the process of discourse, all these dimensions - configurational or not - can be 
neatly put forward by speech, or alternately kept inside the dynamics of the construction 
of meaning as a more or less virtual aspect of what is thematized. In particular, config- 
urational or morphological values are not a systematic basis: they may be pushed in the 
background, or even disappear, superseded by others, which are quite equally funda- 
mental and grammatical. 

More generally, these motifs, as we shall call them as from now, to distinguish them 
definitely from the problematics we criticize, appear deformed, reshaped, in various pro- 
files, abstract as well as concrete. A motif is a unifying principle for this diversity of us- 
es, which can only be understood if one takes into account from the very beginning di- 
mensions of meaning which cannot be integrated into the narrow frame of a schematism 
- at least if by a ‘schematism’ we mean something (still predominant in cognitive lin- 
guistics) which can be traced to kantian philosophy (Kant [1781-1787]; for a discussion 
on this point, cf. Salanskis, 1994). Of course we have to consider all these fundamental 
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dimensions at a very generic level, so as to assume that they are systematically put into 
play, and worked out by each use. But generic as they may be, our thesis is that these di- 
mensions can be traced back to the immediate experience of perception, action and ex- 
pression, if they are conveniently described in their social and cultural setting. This is 
why we decided to drop the designation of scheme, and to adopt the word motif to ex- 
press the kind of ‘germ of meaning’ we wish to attribute to many linguistic units. Indeed, 
the word ‘scheme’ evokes a certain immanentism or inneism, a restricted repertoire of 
categories not constituted by culture and social practices, and a priviledge granted to a 
certain biased representation of the physical world. It is therefore a term not suitable for 
indicating an historical, cultural, ‘transactional’ unifying linguistic principle, whose 
function is to motivate the variety of uses of a grammatical or a lexical unit. 



SOME SKETCHY CONSIDERATIONS ON FRENCH PREPOSITIONS 5 

There are great differences in the systems of prepositions in French and English, espe- 
cially concerning so-called ‘colourless’ or only weakly depictable ‘space prepositions’ 
like EN or PAR. We will here present only short considerations about SUR, SOUS, 
CONTRE, EN, PAR, which evidently call for considerable developments, and should be 
in a systematic mood confronted to other languages. We hope at least that this will be 
understood as a way of challenging the routine frozen expression: “spatial preposition”. 

The case of SUR 

A very sketchy analysis allows us to distinguish the following configurations. 

A ‘region SUR’ constructed at the level of predication ETRE SUR ( ‘to be on’), i.e. a 
construction of a site based on the connection [Preposition + Nominal], localization of 
the noun subject, and the contact enabled by the predicate: 

(1) Le livre est sur la table ( ‘The book is on the table’ ) 

In other cases, the ‘region ON’ is established by the context of the sentence, which al- 
lows for an adjustement or requalification of lexical and syntactic expectations. 

(2) Max s’est effondre dans le fauteuil (‘Max collapsed in the (arm) chair’ ). 

(3) Max a pose timidement une fesse sur le fauteuil (‘Max timidly sat on the (arm) 
chair’ ). 

The motif ‘contact’ is permitted and enabled by the predicate. As opposed to a table or 
a sidewalk, an armchair is not a priori an acceptable object for the predicate ETRE SUR 
(‘to be on’). The requalification is facilitated by the specific reference. 

A zone established as a frame for what happens in the ‘region SUR' . Compared with 
the previous examples, the possible fluctuations between contact and localization in- 
crease. 

(4) Les enfants jouent sur le trottoir (‘The children are playing on the sidewalk’) 

Still, there is a simple correlation between a topological notion and a uniquevocal lo- 
calization in the thematic space. 
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However, this correlation is nullified, or made more complex, by many other uses with 
spatial implications. It may happen that the prepositional phrase does not localize the 
subject of the sentence. 

(5) Pierre joue avec sa poupee sur la table ('Pierre plays with his doll on the table’). 

(6) Pierre a vu un chat sur le balcon ('Pierre saw a cat on the balcony'). 

Nothing indicates that the referent of ‘Pierre’ is localized by the ‘region ON’ (on the 
table, on the balcony). In fact, the contrary is noticably more likely. 

The ‘region ON’ no longer has determined spatial limits at the thematic level. Follow- 
ing examples are quite particular to French, in which we can hypothesize that the motif 
is further developed. 

(7) Pierre travaille sur Paris (‘Pierre works *on/in Paris’). 

(8) Pierre est representant sur la region Nord (‘Pierre is a representative *on/for/in the 
north’). 

Here, the preposition SUR is used in the construction of “functional spaces” (zones 
specified only in the domain of the predication) and not of physical spaces, but the topo- 
logical instruction of contact is preserved. 

The motif of ‘contact’, which, based on the preceding examples, we might believe to 
be simply topological, can actually be easily requalified with new interpretative effects 
for which the spatial inferences are decreasingly concrete, proving itself to be insepa- 
rable from temporal and qualitative modulations (Dendale & De Mulder, 1997, 
whence the following examples): 

- support (weight or imminence). 

(9) Une menace planait sur la ville (‘A threat hovered on?/over the town’). 

- foundation (assessment). 

(10) Juger les gens sur I’apparence (‘To judge people on?/by their appearance’). 

(11) Ilf ut condamne sur de faux temoignages (‘He was convicted on false testimony’). 

- covering. 

(12) La couverture est sur la table (‘The tablecloth is on the table’) 

- objective (goal) 

(13) Marche sur Rome (‘March on Rome’) 

(14) Fixer un oeil sur quelquechose (‘*pose/ *fix / *leave/feast one’s eyes on some- 
thing’. 

-visibility, immediate access ( as opposed to inclusion which would signify depen- 
dance, interposition of a border or a screen). 

(15) II y a un trou sur ta manche ('There is a hole *on/in your sleeve’) 

Semantic cues ‘support’ and/or ‘foundation’ can be extended easily to uses that are de- 
finitively ‘non spatial’ as in: 

(16) Impot sur le revenu (‘tax on income’) 

(17) agir sur ordre ( ‘act on orders’ ) 

(18) Pierre a travaille sur cette question depuis longtemps ( ‘Pierre has been working 
on this question for a long time’). 

Or even: 

(19) Sur cette question, Pierre n ’a rien a dire (‘On this issue, Pierre has nothing to say’). 
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Here the motif of contact is invested in a thematic zoning, which can be specified on- 
ly in the domain opened by the predication or the introductor nominal argument. 

Let us also remember the temporal uses differentially specifiable, which emerge from 
the motif of contact. 

(20) Sur ce, il disparut a jamais (‘*On/after this he disappeared for ever’) 

(21) Pierre est sur le depart (‘Pierre is about to leave’) 

(22) II y a eit des gelees sur le matin (‘There was a frost this morning/on the morn’ (ar- 
chaic)) 

(23) II faut agir sur le champ (‘One must act at once’). 

In compter sur ses amis (‘to count on ones friends’), miser sur le bon cheval (‘bet on 
the right horse’), without entirely abandoning a certain value of ‘to lean on’, a modu- 
lation of the original motif, the preposition is requalified as a rectional marker. 

These examples not only invalidate purely spatial and physical explanations of SUR. 
They also weaken explanations based on abstract topological schemas, which often seem 
artificial and demand further qualifications which call into doubt their validity. Above 
all, this type of schematics does not provide operable explanations, and as a result does- 
n’t explain why only certain values and not others are called upon (by interaction with 
the surrounding lexical material, as we say). What’s missing here is the possibility of rec- 
ognizing the affinity and interrelation of these different values, which we would like to 
stabilize by way of lexico-grammatical motifs. 

In this way, the topological instruction, even when purely configurational and despa- 
tialized (i.e. conceived independantly of the perceived space) seems to flag behind a rich- 
er, more open definition-delimitation of two ‘segments’ or ‘phases’ as they are construed 
during any type of contact. Compared to the image of ‘surface’ often invoked (geometri- 
cal notion), or to that of ‘height’ (Weinrich 1989), this motif of ‘contact’ would have the 
same statue as that of ‘coalescence’ for EN, or of ‘means’ in the case for PAR. Beyond its 
dynamic value it also offers a static characteristic which provides a border or a stabilized 
variation (localization, support) but it is fundamentally an aspectual motif, intentional in 
aim and in practice. At once a motif of exploitation and of valorisation of this contact by 
a type of immediate interaction (leaning, rebounding, perlaboration), giving the values of 
objective, imminence, achievement, effect, transition, cause and effect. Its configurational 
expression, once fully deployed, includes an axial orientation of momentum, another 
transversal orientation for the contact zone and the exteriority maintained between the 
two phases thus delimited, (if the contact zone is in fact the topological frontier of the ac- 
cess zone, it is still not appropriated as its border, but remains ‘exterior’ ). 

Localization can certainly be explained in euclidean terms: surface, height, width, etc. 
But the diversity of possible instances of localization (the rich variety of contributing el- 
ements) calls for dimensions which are more dynamic (force, figure/background) com- 
pared to the more configurational ones. In the phrase cup on the table, we might empha- 
size the importance of [bearing-weight]. In bandage on the arm, drawing on the wall, 
handle on the door, apple on the branch, ON constitutes the sight as a [background], 
which guarantees a [detachability] for the figure, regardless of any more objective rela- 
tions with the object/surface. 
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The case of SOUS 

One can uncover five ‘experiental types’ (evidently a nice example of a family resem- 
blance in the wittgensteinian sense): 

- low position: sous la table (‘under the table’); sous les images (‘under the clouds’); 

- covering/protection: sous la couette (‘under the covers’); (objet enfoui) sous la neige 
(‘under the snow’); sous une meme rubrique (‘under/in the same rubric’); 

- exposition: sous la pluie (under*/in the rain); (marcher) sous la neige ((walk) ‘in the 
snow’); sous les regards (‘under the eyes of x’); sous les bombes (‘under fire’ ); sous la 
menace ( ‘under the gun ’); 

- inaccessibility: sous terre (‘underground’); sous le sceau du secret (‘under heavy 
guard’); 

- depending from the external: sous surveillance (‘under surveillance’); sous influence 
(‘under the influence’); sous la contrainte ( ‘under pressure ’); sous garantie (‘under war- 
ranty’); sous arrestation (‘under arrest’). 

These uses involve a co-adjustement of the values selected from the NPs assigned by 
the preposition, and in some cases by the introductory element (see the example of 
snow). Together they evoke family resemblances of covering, protection, inaccessabili- 
ty, exposure to, and dependence upon, in varying degrees of explicitness? 

Among the notions evoked above, certain seem more oriented to a topological 
schematic pole (surface constructed by the PP which establishes an interior space based 
on that boundary. The others closer to a more “instructional” pole (Cadiot 1999) which 
consists of the more dynamic values, aspectualised by a quasi praxeological perspective 
(no exit dynamic, opening blocked) indexed on the ambivalence of the situation (cover- 
ing vs. exposed). Articulating these two poles of the boundary, which remains separate 
from the interior space, is just the configurational expression of this blocking and am- 
bivalent. As in the case of SUR, this complex motif is diversely profiled and stabilized: 
by valorization, specification, or on the contrary inhibition, retreat, aspectualization of 
the different values it unifies. 



The case of CONTRE 

Let’s note the following four ‘experiential types’: 

- Proximity with contact: s’appuyer contre le mur (‘leaning against a wall’ ). 

- Opposition (conflict): etre contre le mur de Berlin (‘be against the Berlin wall’); con- 
tre toute attente ( ‘against ah expectations’). 

- Exchange: echanger sa vieille voiture contre un scooter (‘trade one’s old car for a 
scooter’). 

- Proportion / comparison: vingt mauvais films contre un bon (‘20 bad films *against/ 
for one good one’). 

For CONTRE we propose a motif instituting the affinity of opposition and reconciliation 
(force/counter-force, posing/opposing). This motif is sustainable, up to a certain point, in a 
schematic framework, which could be capable of reflecting relational categories like 
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[force] in a plurality of spaces (not necessarily physical). But we insist again that this mo- 
tif-schema must be modulated and specified in accordance with plausible profiles. As a re- 
sult values such as ‘counter-force’ or ‘dynamic coming together’ can disappear almost 
completely from the profile. Even when so “virtualized” as in Sofa against the wall they 
remain as a motivation for the internal perspective or ‘aspect’ of the dynamic. 

The case of EN 



We will show two points: 

- there is no clear-cut distinction between spatial and non-spatial uses or senses; 

- the specifically linguistic meaning of it should be accessed in an immediate combi- 
nation of schematic and intentional dimensions: 

Let’s have a look at following phrases: 



(1) 


pommier en fleurs 


‘apple tree in bloom’ 


(2) 


chien en chaleur 


‘dog in heat’ 


(3) 


femme en cheveux 


‘hair-dressed woman 


(4) 


propos en Tair 


‘words up in the air’ 



The sense of these phrases can be paraphrazed by following intuitive formulations or 
characterizations: ‘globally saturated physical image’ (1), ‘invasion’ (2), ‘emblematic 
access’ (3), ‘taken over from the inside/outside’ (4). 

They tend to show that space is only involved at a thematic level, and in some sort of 
continuous variation. The characterisations can be resumed in an unique notion, or mo- 
tif, of coalescence, with no linguistically prescribed limits or ‘homage’ (bordering), and 
assymetricaly oriented toward the referent of the second NP. The image of the first NP 
is, so to say, absorbed in the image of the second (fleurs, chaleur, cheveux, air). 

But this motif is not only schematic or perceptual. It coalesces with a more instructional 
dimension: one has to associate the resulting image with its perspective, and with the in- 
tention through which or by which it was brought about. The scene is necessarily ani- 
mated by the process which generated it. Otherwise other prepositions like DANS (with 
its bornage instruction) or even AVEC would be more appropriate. 

A more direct evidence for this rather intuitive interpretation can be drawn from other 
data where space is not involved: 

(5) Max est enfaute (‘Max is mistaken’) / *Max est en erreur 

(6) Max est en tort (‘Max is wrong’) / *Max est en raison 

(7) Max est en beaute (‘Max is handsome’) / *Max est en laideur 

(8) Max est en vie (‘Max is alive’) / *Max est en mort 

(9) Max est en difficulty (‘Max is in difficulties’) / *Max est enfacilite. 

There seems to be a rather regular paradigm of such cases, where only the ‘resulting 
states’ which can be associated with the intentional, subjective object-oriented path, or 
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purpose that brought them about can be correctly introduced by EN. For example Max 
est en vie is pragmatically possible only in as much as one has reasons to believe that he 
could be dead (after some accident, presumably); Max est enfaute, because he has done 
or said something which happened to be wrong or inappropriate; Max est en beaute 
means more than Max est beau: that he tried or at least, wished to be handsome... 

The case of PAR 

Even more evidently, it is impossible to differentiate spatial and not spatial uses in the 
case of PAR. 



etre emporte par le courant 
passer par le jardin 
prendre par la gauche 
regarder par le trou de la serrure 
attraper par la cravate 
tuer par bade 

convaincre par son comportement 
impressionner par son intelligence 
passer par des moments difficiles 
renoncer par lassitude 



‘to get carried away by the current’ 

‘to go through the garden’ 

‘to take a lefthand turn’ 

‘to look through the key-hole’ 

‘to grab by the tie’ 

‘to kill by bullets’ 

‘to convince by one’s behaviour’ 

‘to impress by/with one’s intelligence’ 
‘to come through hard times’ 

‘to give up from/because of lassitude’. 



In English, BY works better with active referents and tends to internalize them in the 
scope of the schema, while with more external complements, THROUGH or even BE- 
CAUSE OF are better, and WITH seems at least to initiate a motion of externalization, 
or ‘parallelization’. As is well known, PAR is typically used to express agentivity in 
passive constructions or in any type of constructions where a process is described from 
the point of view of its activation. So it expresses an inner activation principle. Being 
‘inner’ corresponds to the schematic dimension, being ‘agentive’ to the intentional one. 
But both are intimately correlated and coactive in every instance, even when it corre- 
sponds to no specific local thematic or referential intuition. 

We stop here this series of examples, and try now to draw some general conclusions. 
What is actually our own perspective? In summary, we advocate: 

• No privilege for spatial or physical usage of words (as conceived by current trends in 
Cognitive Linguistics), and consequently no doctrine of metaphorical transfer of mean- 
ing, going from the spatial and/or physical uses towards more ‘abstract’ ones (as cur- 
rently conceived by the same linguistics) 

• Search for grammatical motifs , which are ways of giving/apprehending/displaying, 
immediately available in all semantic domains, without any analogical or metaphorical 
transfer stemming from more specific values, allegedly conceived as the primitive ones 

• Rejection (most of the time) of purely configurational versions of those motifs : on the 
contrary, a motif especially a grammatical one, is an unstable, and at the same time a 
strongly unitized, mean of building and accessing ‘semantic forms’; it ties together, and 
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defines a kind of transaction between many dimensions which cannot be dissociated at 
its level, but at the level of profiling inside specific semantic domains 

• Rejection of an ‘immanentist’ explanation of the variety of uses, based upon an 
identification of the motif with some kind of ‘autonomous’ potential; indeed, depend- 
ing on the specific use, some dimensions of the motif can be further specified, enriched 
with other dimensions, or on the contrary virtualized, even completey neutralized. The 
parameters controlling the profiling dynamics are not an internal property of the mo- 
tif. the relation between the motif and a particular profile has to be considered as a lin- 
guistic motivation, because profiling a motif consists of recovering it within other dy- 
namics, brought about by the co-text and the context, i.e. by an ongoing hermeneutic 
perspective 

• A conception of the grammatical motifs (e.g. a motif of a preposition) as highly un- 
stable ‘forms’ (or germs of forms) which can be stabilized only by interaction with the 
others constituents of surrounding syntagms, or even by more distant elements of the co- 
text: as we have said, this stabilization is not a ‘simple’ instantiation of the motif, but a 
recapture by other non immanent dynamics giving rise to the variety of its profiles. 

Actually, this approach is very general, and applies both to grammar and to lexicon. It 
is strongly different from other approaches currently worked out by cognitive linguistics. 
We have already underlined some differences in the analysis of the grammatical expres- 
sion of space, and in the assessment of its status relatively to the global functioning of 
the concerned units. But the situation is the same for grammar as a whole, and in partic- 
ular regarding its difference with the lexical aspects of meaning. In short, we could say 
that cognitive linguistics tend to limit semantics to grammar, and grammar to a certain 
kind of ‘schemes’. We have just criticized their schematism, as well as the conception of 
perception to which it is correlated. Indeed, concerning the type of the grammatical 
schemes, and their relation to our external, everyday perception, we have seen that two 
main attitudes can be distinguished: 

• sometimes, the schemes are from the very beginning merged with a very peculiar 
conception of the physical world, in which the fundamental role of action, and of other 
kinds of anticipations, is underestimated (cf. Talmy, or Vandeloise 1991); 

• sometimes they are abstract, and purely topological/configurational (Langacker). 

The reason for this false alternative is simple: there is no generic diagrammatic repre- 
sentation of action, animacy, interiority, expressivity, intentionality and anticipation, as 
they are constituted by their cognitive, social, cultural and... linguistic modalities. So 
that whenever one tries to take some of these dimensions into account, the only way to 
recover some expressions of them is to resort to the physical experience - which is at the 
same time wrongly apprehended. Once again, such a conception of our ‘immediate ex- 
perience’ not only provokes an impoverishment of the theory of grammar, it also intro- 
duces a gap between grammar and lexicon, as well as between the so-called litteral 
meaning and the figurative ones. Finally, so to speak, the only relation between gram- 
mar and lexicon, is... schematism ! And the only relation between the registered basic 
lexicon and the variety of uses is... a metaphoric relation to space ! In short, we think 
that cognitive linguistics have up to now too strongly dissociated ‘structure’ (identified 




